TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

Created by @buun Created 2026-03-27T17:28:26Z
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

Showing 87 experiments

ID Title / Hypothesis Result Confidence Reproductions Metrics
cexp_55ab69
Promoting quality-sensitive layers to q8_0 improves PPL while maintaining compression
success
0.14
1/5
pplprefill_pp4096decode_tg64compression_ratio
cexp_59b5fc
Training codebooks to lower MSE always improves perplexity
negative
0.14
1/5
ppl_50iterppl_200iter
cexp_5c19ee
Per-channel RMS-based scaling before L2 norm + FWHT reduces turbo3 quantization error on head_dim=128 where channels are anisotropic
success
0.14
1/5
pplppl_q8_baselineppl_turbo3_no_innerq
cexp_601201
MLX TurboQuant preserves needle-in-haystack retrieval
success
0.14
1/5
niah_2_5bit_scoreniah_2_5bit_totalniah_3_5bit_scoreniah_3_5bit_total
cexp_6237df
Alpha should vary logarithmically with context length to track the changing optimal operating point
success
0.14
1/5
alpha_3bit_2kalpha_3bit_32kalpha_2bit_2kalpha_2bit_32k
cexp_697142
TCQ (512-state bitshift trellis with Viterbi encode, O(1) sliding-window decode) improves KV cache quality over scalar Lloyd-Max
success
0.14
1/5
ppl_turbo3_tcqppl_turbo2_tcqppl_turbo2_scalarppl_q8_baseline
cexp_69a1fe
Reducing constant memory addresses from 8 to 4 via magnitude-only LUT with XOR sign recovery improves decode on pre-M5 Apple Silicon
success
0.14
1/5
decode_tok_s_8kdecode_ratio_vs_q8vs_ceiling_pctspeedup_vs_baseline
cexp_6e68ef
Inlining dequant into flash attention loop eliminates function call overhead
failure
0.14
1/5
decode_tok_s_8kvs_ceiling_pct
cexp_6fa654
α = ||x||·dot(x,q)/||q||² halves per-element MSE vs L2-preserving β = ||x||/||q||
negative
0.14
1/5
ppl_l2_preservingppl_mse_optimal
cexp_765085
Different quantization types for K vs V can improve quality/speed tradeoff
inconclusive
0.14
1/5
ppl_turbo4k_q8vppl_q8k_turbo3vppl_turbo4k_turbo3vppl_turbo3k_turbo4vdecode_q8k_turbo3vdecode_turbo4k_q8v
cexp_799341
Sparse V benefits are not turbo3-specific, also work on q8_0
success
0.14
1/5
decode_speedupppl_changeniah_change
cexp_80c0cd
Deferring norm multiply to after LUT lookup improves ILP
failure
0.14
1/5
decode_tok_s_8kvs_ceiling_pct
cexp_8410e2
Establish dequant baseline with standard 8-entry constant LUT
baseline
0.14
1/5
decode_tok_s_8kdecode_ratio_vs_q8vs_ceiling_pctceiling_tok_s
cexp_8593ff
Reorder FWHT output by sequency to group similar-frequency components
neutral
0.14
1/5
ppl_baselineppl_walsh
cexp_8aeaf9
Skipping V dequant for attention weights below 1e-6 has zero quality impact and improves decode speed
success
0.14
1/5
decode_speedup_32kppl_sparse_v_onppl_sparse_v_offniah_sparse_vniah_sparse_v_totalniah_q8_0niah_q8_0_totalskip_rate_512skip_rate_4kskip_rate_32k
cexp_9180a2
Cross-lane register transfer via simd_shuffle avoids constant memory entirely
failure
0.14
1/5
decode_tok_s_8kvs_ceiling_pct
cexp_926406
Named register variables with ternary select avoids both LUT and array indexing
failure
0.14
1/5
decode_tok_s_8kvs_ceiling_pct
cexp_92ae80
Clifford algebra rotor rotation is faster than dense matrix rotation
success
0.14
1/5
speedup_cuda_1kspeedup_cuda_16kspeedup_metal_1kspeedup_metal_65kparameter_reduction
cexp_93a2c9
Remapping turbo3 3-bit index to {mag_idx, sign_bit} halves register LUT pressure, improving decode
neutral
0.14
1/5
decode_tg64_4kdecode_tg64_32kdecode_tg64_4k_baselinedecode_tg64_32k_baselineppl
cexp_954f7f
Replacing fixed TBQ_CHUNK=4096 with cudaMemGetInfo-based calculation enables better VRAM utilization
success
0.14
1/5
ppl_2k_mmappl_2k_chunkedppl_8k_mmappl_8k_chunkedppl_32k_mmappl_32k_chunked
cexp_96a193
turbo3 achieves near-q8_0 quality on Apple Silicon with 4.6x compression
success
0.14
1/5
compression_ratioprefill_tok_sprefill_ratio_vs_q8decode_tok_s_2kdecode_ratio_vs_q8_2kdecode_tok_s_8kdecode_ratio_vs_q8_8kdecode_tok_s_32kdecode_ratio_vs_q8_32kperplexityraw_kv_kurtosispost_rotation_kurtosispost_rotation_stdexpected_stdstd_ratio
cexp_991050
turbo3 enables full 128K context on 24GB GPU where q8_0 OOMs
success
0.14
1/5
prefill_pp131072decode_tg64_128kvram_gb
cexp_9ceeb7
Processing queries in batches reduces S buffer from O(nh_q*nq*chunk) to O(nh_q*q_batch*chunk)
success
0.14
1/5
s_buffer_27b_70k_before_gbs_buffer_27b_70k_after_mbppl_32k
cexp_a2f5f5
KVLinC claims rotation hurts keys — test K-only unrotated
negative
0.14
1/5
ppl_both_rotatedppl_k_unrotated_v_rotatedppl_neither_rotated
cexp_ac6e32
Moving backtrace from shared to global memory and double-buffering cost arrays reduces syncthreads in Viterbi encode
success
0.14
1/5
decode_dense_baselinedecode_dense_optdecode_moe_baselinedecode_moe_optppl_baselineppl_optimized
cexp_afba59
Skipping entire KV tiles when all QK scores are far below running max reduces V dequant work at long context
negative
0.14
1/5
decode_baseline_2kdecode_skip_2kdecode_baseline_65kdecode_skip_65k
cexp_b43ed6
Weighting codebook training by query-norm distribution (Q²) improves downstream KLD by optimizing for attention-weighted distortion
success
0.14
1/5
cexp_b7f864
Pure arithmetic (mul+add) to reconstruct centroids without any memory access
failure
0.14
1/5
decode_tok_s_8kvs_ceiling_pct
cexp_bbb376
Skip V dequant for negligible attention weights (exp(score-max) < 1e-6)
success
0.14
1/5
decode_tg64_8k_beforedecode_tg64_8k_afterdecode_tg64_32kppl
cexp_bd8d18
turbo3 quality generalizes to dense architectures on Apple Silicon
success
0.14
1/5
compression_ratioperplexity

Proposed Experiments

Gemma 4's K=V shared projections cause catastrophic V quantization (+70% PPL). Need K-only quantization or a specialized correction
buun via cuda-rtx3090
Perplexity logits buffer requires >37GB host RAM at 65K context. Need to confirm adaptive chunking + Q-batching hold PPL-match at this scale.
EXP-0002
context: 65536 chunks: 8 cache_type: tbq3
dusterbloom via adaptive-chunked-prefill
Adaptive chunking should show higher prefill throughput than fixed chunk=4096 at long contexts by choosing the largest viable chunk size.
EXP-0002
contexts: [2048 approach: [adaptive
dusterbloom via adaptive-chunked-prefill
EXP-0009 showed the fused kernel's serial KV loop causes 35-43x slowdown vs tensor cores. A proper tiled approach (Bc>1 KV tokens per tile, warp-level MMA on dequanted tiles) should close the gap to 2-5x by exploiting tensor core parallelism while still avoiding full materialization of dequanted KV.
EXP-0009
approach: tiled_fused_attention tile_kv: [16 tile_q: [16 use_mma: true bits: 3
dusterbloom via adaptive-chunked-prefill
Instead of dequant-then-MMA, a custom WMMA kernel that reads TBQ3 packed data and applies inverse SRHT inside the tile accumulator could achieve near-native f16 throughput. The key is amortizing the 7-stage butterfly over a full Bc tile rather than per-token.
EXP-0009
approach: native_tbq3_matmul tile_m: 16 tile_n: 16 tile_k: 128 bits: 3
dusterbloom via adaptive-chunked-prefill
Old tokens requantized turbo3_tcq to turbo2_tcq. ~30% extra memory savings at acceptable quality cost for tokens with negligible attention weight
decay_threshold_positions: 16384 source: turbo3_tcq target: turbo2_tcq
buun via cuda-rtx3090
Allow --cache-type-k "turbo3_tcq:0-31,q8_0:32-39" syntax for manual per-layer control — enables fine-grained quality/compression tradeoffs beyond fixed layer-adaptive modes
syntax: type:layer_range separator: ,
buun via cuda-rtx3090
TBQ2's aggressive 2-bit quantization allows very large KV caches. At 200K+ context adaptive chunking should keep peak VRAM bounded while maintaining acceptable PPL.
EXP-0001
context: 204800 cache_type: tbq2 approach: adaptive_chunk_sizing
dusterbloom via adaptive-chunked-prefill
There is a sweet spot between chunk sizes — smaller chunks waste kernel launches, larger chunks thrash VRAM. Profiling 256..8192 at a fixed context reveals the tradeoff.
EXP-0002
context: 32768 chunk_sizes: [256 cache_type: tbq3
dusterbloom via adaptive-chunked-prefill
EXP-0008's compressed-domain trick (eliminating 14 butterfly stages per KV token) has maximum impact on Apple Silicon where there are no tensor cores and butterfly is expensive relative to total compute. Port the compressed-domain kernel to Metal and benchmark on M-series.
EXP-0008
approach: compressed_domain_attention backend: metal head_dim: 128 bits: 3
dusterbloom via adaptive-chunked-prefill
EXP-0006 verified D=256 correctness but did not benchmark throughput. The two-butterfly approach for D=256 may have different performance characteristics than D=128 due to doubled shared memory usage and register pressure.
EXP-0006
head_dim: 256 model: gemma-3-12b contexts: [2048 approach: fused_dequant_attention
dusterbloom via adaptive-chunked-prefill