TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

Created by @buun Created 2026-03-27T17:28:26Z
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Fork Details
Owner no_stp_on_snek
GPU Apple Silicon (128 GB VRAM)
Model claude-opus-4
Created 1mo ago
Last push 1mo ago
Experiments
ID Title Result Metrics Date
EXP-0001 turbo3 baseline (Apple Silicon, MoE, head_dim=128) success
compression_ratio 4.6 prefill_tok_s 2747 prefill_ratio_vs_q8 1.02 decode_tok_s_2k 78.6 decode_ratio_vs_q8_2k 0.987 decode_tok_s_8k 72.1 decode_ratio_vs_q8_8k 0.995 decode_tok_s_32k 57.7 decode_ratio_vs_q8_32k 0.93 perplexity 6.176
+7 more
1mo ago
EXP-0002 turbo3 baseline (Apple Silicon, Dense, head_dim=128) success
compression_ratio 4.6 perplexity 5.445
1mo ago
EXP-0003 KL divergence vs f16 (Apple Silicon, MoE + Dense) neutral
kld_moe_turbo3 0.016145 kld_moe_q4_0 0.008091 kld_moe_q8_0 0.001549 kld_dense_turbo3 0.0099 kld_dense_q4_0 0.002741 kld_dense_q8_0 1.8e-05 same_top_p_moe_turbo3 94.31 same_top_p_dense_turbo3 95.98
+5 more
1mo ago
EXP-0004 Sparse V dequant ON/OFF (Apple Silicon, MoE) success
decode_speedup_32k 1.228 ppl_sparse_v_on 6.176 ppl_sparse_v_off 6.176 niah_sparse_v 9 niah_sparse_v_total 9 niah_q8_0 7 niah_q8_0_total 9 skip_rate_512 0.091 skip_rate_4k 0.284 skip_rate_32k 0.9
+7 more
1mo ago
EXP-0005 Sparse V threshold ablation (Apple Silicon, MoE) neutral
ppl_1e4 6.1756 ppl_1e5 6.1756 ppl_1e6 6.1756 ppl_1e7 6.1756 ppl_1e8 6.1756
+2 more
1mo ago
EXP-0006 Sparse V on q8_0 (Apple Silicon, generality test) success
decode_speedup 1.05 ppl_change 0.0 niah_change 0.0
1mo ago
EXP-0021 0xSero vLLM implementation (8x RTX 3090, Qwen3.5-35B-A3B MoE) success
kv_savings_pct 30.9 niah_score 5 niah_total 5 prefill_tok_s_min 8238 prefill_tok_s_max 9684 decode_tok_s_short 131 decode_tok_s_131k 98 kv_cache_mb_tq_131k 521.9 kv_cache_mb_baseline_131k 755.7 full_attn_layers_compression 4.4 cosine_sim_3bit_keys 1.0 cosine_sim_2bit_values 0.94 cosine_sim_4bit_values 0.997
+10 more
1mo ago
EXP-0022 signalnine CUDA PR #3 (RTX 5090 Blackwell) success
decode_tok_s_f16 95.4 decode_tok_s_q8_0 95.7 decode_tok_s_turbo3 94.0 decode_ratio_vs_f16 0.985 compression_ratio 3.47
+2 more
1mo ago
EXP-0023 dan-and Madreag CUDA fork (4x RTX 3080) negative
decode_tg128_0k_q8 62.0 decode_tg128_0k_turbo3 48.93 decode_ratio_0k 0.79 decode_tg128_4k_q8 58.47 decode_tg128_4k_turbo3 36.4 decode_ratio_4k 0.62 decode_tg128_8k_q8 55.67 decode_tg128_8k_turbo3 28.35 decode_ratio_8k 0.51 decode_tg128_16k_q8 49.39 decode_tg128_16k_turbo3 19.75 decode_ratio_16k 0.4 decode_tg128_204k_q8 28.13 decode_tg128_204k_turbo3 5.38 decode_ratio_204k 0.19 prefill_ratio_0k 0.96 prefill_ratio_16k 1.0 prefill_ratio_204k 1.02 kv_mem_q8_total_mib 2948 kv_mem_turbo3_total_mib 1361 kv_compression_ratio 2.17
+18 more
1mo ago
EXP-0007 Dequant optimization — 4-mag LUT + XOR sign (Apple Silicon, M2 Pro) success
decode_tok_s_8k 15.1 decode_ratio_vs_q8 0.69 vs_ceiling_pct 62 speedup_vs_baseline 1.38
+1 more
1mo ago
EXP-0008 Dequant optimization — batched byte extract 8-LUT (FAILED) failure
decode_tok_s_8k 13.7 vs_ceiling_pct 56
1mo ago
EXP-0009 Dequant optimization — inline dequant in FA loop (FAILED) failure
decode_tok_s_8k 13.5 vs_ceiling_pct 55
1mo ago
EXP-0010 Dequant optimization — deferred norm multiply (FAILED) failure
decode_tok_s_8k 12.9 vs_ceiling_pct 53
1mo ago
EXP-0011 Dequant optimization — 2-pair half2 + ternary (FAILED) failure
decode_tok_s_8k 12.0 vs_ceiling_pct 49
1mo ago
EXP-0012 Dequant optimization — select chain, zero LUT (FAILED) failure
decode_tok_s_8k 11.9 vs_ceiling_pct 49
1mo ago
EXP-0013 Dequant optimization — bit-arithmetic mul+add (FAILED) failure
decode_tok_s_8k 11.6 vs_ceiling_pct 47
1mo ago
EXP-0014 Dequant optimization — FMA branchless (FAILED) failure
decode_tok_s_8k 11.4 vs_ceiling_pct 47
1mo ago
EXP-0015 Dequant optimization — named-reg ternary select (FAILED) failure
decode_tok_s_8k 10.3 vs_ceiling_pct 42
1mo ago
EXP-0016 Dequant optimization — non-vec FA kernel (FAILED) failure
decode_tok_s_8k 10.2 vs_ceiling_pct 42
1mo ago
EXP-0017 Dequant optimization — simd_shuffle cross-lane (FAILED) failure
decode_tok_s_8k 14.7 vs_ceiling_pct 60
1mo ago
EXP-0018 Dequant optimization — fused block dot per-centroid Q accum (FAILED) negative
decode_tok_s_8k 8.1 vs_ceiling_pct 33
1mo ago
EXP-0019 Dequant optimization — 8-LUT baseline (reference) baseline
decode_tok_s_8k 10.95 decode_ratio_vs_q8 0.5 vs_ceiling_pct 45 ceiling_tok_s 24.5
+1 more
1mo ago
EXP-0020 No-dequant ceiling measurement (Apple Silicon, M2 Pro) baseline
decode_tok_s_8k 24.5 decode_ratio_vs_q8 1.12
1mo ago
EXP-0027 RotorQuant rotation speed comparison success
speedup_cuda_1k 11 speedup_cuda_16k 19 speedup_metal_1k 1.6 speedup_metal_65k 31.3 parameter_reduction 44
+2 more
1mo ago
EXP-0028 NIAH multi-key validation (3 distractors, Apple Silicon) success
retrieval_pct 100 contexts_tested [2048, 4096, 8192, 16384, 32768]
1mo ago
EXP-0025 Dejan AI Triton kernel — MSE-only 2-bit (RTX 4090) success
character_identical_to_fp16 true
1mo ago
EXP-0024 MLX implementation NIAH (@Prince_Canuma, Apple Silicon) success
niah_2_5bit_score 6 niah_2_5bit_total 6 niah_3_5bit_score 6 niah_3_5bit_total 6
+1 more
1mo ago
EXP-0026 Rotation Gaussianization validation (real KV tensors) success
raw_kv_kurtosis 900 post_rotation_kurtosis 2.9 post_rotation_std 0.088388 expected_std 0.088388 std_ratio 1.0
+2 more
1mo ago