TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

Created by @buun Created 2026-03-27T17:28:26Z
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

Showing 87 experiments

ID Title / Hypothesis Result Confidence Reproductions Metrics
cexp_bf9572
Gemma 4's K=V shared projections interact differently with KV cache quantization due to correlated K/V errors
inconclusive
0.14
1/5
kld_gemma4_q8kld_qwen35_q8
cexp_c05db2
Processing 2 elements at once with half2 and ternary sign reduces address count
failure
0.14
1/5
decode_tok_s_8kvs_ceiling_pct
cexp_c08e7e
Threshold choice across 5 orders of magnitude does not affect quality
neutral
0.14
1/5
ppl_1e4ppl_1e5ppl_1e6ppl_1e7ppl_1e8
cexp_c4ea9d
Flipping computation to iterate over centroids and accumulate matching Q elements avoids per-element lookup
negative
0.14
1/5
decode_tok_s_8kvs_ceiling_pct
cexp_c9ca34
Storing first N tokens at fp16 improves PPL (sink tokens get disproportionate attention)
neutral
0.14
1/5
ppl_no_sinkppl_4_sinksppl_8_sinksppl_16_sinks
cexp_cfe4ca
KV cache operations are negligible fraction of total decode compute at batch_size=1
success
0.14
1/5
cexp_d4375d
Establish baseline PPL values for TBQ types on Qwen3.5-9B
baseline
0.14
1/5
ppl_f16ppl_q8_0ppl_tbq4ppl_tbq3ppl_tbq2
cexp_d6b923
turbo3 maintains retrieval accuracy with multiple distractor needles through 32K context
success
0.14
1/5
retrieval_pct
cexp_dd82bb
Fully branchless FMA chain with zero memory access beats constant LUT
failure
0.14
1/5
decode_tok_s_8kvs_ceiling_pct
cexp_e22a96
Gemma-3 V cache broken because V un-rotation missing from iSWA build_attn overload
success
0.14
1/5
ppl_q8_baselineppl_turbo3_kv_after_fixppl_turbo3_k_onlyppl_turbo3_kv_before_fix
cexp_e426cb
Fusing activation quantization into the weight GEMM kernel eliminates separate q8_1 quantize kernel launch + L2 traffic
negative
0.14
1/5
decode_baselinedecode_f32_fmadecode_inreg_dp4adecode_smem_dp4a
cexp_e5416b
turbo3 with FWHT rotation + norm correction matches q8_0 quality
success
0.14
1/5
pplppl_q8_baselinecompression_ratio
cexp_e66662
QJL correction in turbo4 is unnecessary overhead
failure
0.14
1/5
ppl_with_qjlppl_without_qjlprefill_without_qjlprefill_with_qjl
cexp_e68bfa
Alpha applied at decode time (scaling dequantized V) vs encode time (baked into fp16 norm before quantization) may give different results, with decode-time enabling context-adaptive deployment
success
0.14
1/5
cexp_ec2202
Better byte extraction with 8-entry LUT improves over baseline
failure
0.14
1/5
decode_tok_s_8kvs_ceiling_pct
cexp_ee1dd9
Per-channel scaling before FWHT reduces head_dim=128 quality gap by aligning channel variances
negative
0.14
1/5
theoretical_gain
cexp_f0f034
Subtract per-element mean before FWHT to improve quantization
negative
0.14
1/5
ppl_turbo3_baselineppl_turbo3_dcppl_turbo4_baselineppl_turbo4_dc
cexp_f2b4aa
turbo3 KLD tracks bit rate, not implementation quality
neutral
0.14
1/5
kld_moe_turbo3kld_moe_q4_0kld_moe_q8_0kld_dense_turbo3kld_dense_q4_0kld_dense_q8_0same_top_p_moe_turbo3same_top_p_dense_turbo3
cexp_f85c20
Capturing decode step as CUDA Graph eliminates per-kernel launch overhead (hundreds of launches per token)
success
0.14
1/5
decode_baselinedecode_cuda_graphs
cexp_3e9a2e
Asymmetric q8_0/tbq3 will maintain <5% prefill gap and decode parity across diverse model architectures
success
0.08
1/5
cexp_466df6
The fused kernel's end-to-end slowdown is amplified/masked by non-attention compute (FFN, norms, embeddings). By comparing f16 KV (zero dequant overhead) vs TBQ3 baseline vs TBQ3 fused, we can isolate the attention-only time and understand what fraction of total time is available for optimization.
success
0.08
1/5
cexp_59b1a9
Since the SRHT (H, diag(r)) is fixed across all tokens, we can pre-rotate Q with forward SRHT once, compute K scores directly against centroids (no K butterfly), accumulate V in compressed domain (no V butterfly), and apply ONE inverse SRHT at the end. This eliminates 14 butterfly stages per KV token from the inner loop, reducing per-token compute by ~8x.
success
0.08
1/5
cexp_6d957c
Using absolute query positions (nkv-nq+q_start+q_len) for causal skip correctly prunes Q batches without affecting PPL, saving 38-47% compute during full prefill
success
0.08
1/5
cexp_70847d
Adding head_dim=256 support to the fused TBQ3 attention kernel enables it to work with Qwen3.5-27B and Gemma-3 models. The kernel needs per-block SRHT (two independent 128-element butterflies for D=256) and FLOATS_PER_LANE=D/WARP_SIZE generalization.
success
0.08
1/5
cexp_c69dc2
Fusing TBQ3 dequant (inverse SRHT) directly into a FlashAttention-style online softmax kernel eliminates all intermediate buffers (k_tmp, v_tmp, S) while producing identical results
success
0.08
1/5
cexp_e7c22d
Bulk-dequanting V to fp16 before MMA launch (instead of fusing V dequant into the tile loader) will close the pp8192 gap because V can use the standard cp.async.cg pipeline
success
0.08
1/5
pp8192_beforepp8192_after
cexp_e7e920
The fused kernel's Q addressing was swapped — it used q_head * (nq * D) instead of q_idx * (nh_q * D). After ggml_permute(0,2,1,3), Q physical layout is [nq, nh_q, D] (token-major) not [nh_q, nq, D] (head-major). Fixing this should make PPL match baseline.
success
0.08
1/5

Proposed Experiments

Gemma 4's K=V shared projections cause catastrophic V quantization (+70% PPL). Need K-only quantization or a specialized correction
buun via cuda-rtx3090
Perplexity logits buffer requires >37GB host RAM at 65K context. Need to confirm adaptive chunking + Q-batching hold PPL-match at this scale.
EXP-0002
context: 65536 chunks: 8 cache_type: tbq3
dusterbloom via adaptive-chunked-prefill
Adaptive chunking should show higher prefill throughput than fixed chunk=4096 at long contexts by choosing the largest viable chunk size.
EXP-0002
contexts: [2048 approach: [adaptive
dusterbloom via adaptive-chunked-prefill
EXP-0009 showed the fused kernel's serial KV loop causes 35-43x slowdown vs tensor cores. A proper tiled approach (Bc>1 KV tokens per tile, warp-level MMA on dequanted tiles) should close the gap to 2-5x by exploiting tensor core parallelism while still avoiding full materialization of dequanted KV.
EXP-0009
approach: tiled_fused_attention tile_kv: [16 tile_q: [16 use_mma: true bits: 3
dusterbloom via adaptive-chunked-prefill
Instead of dequant-then-MMA, a custom WMMA kernel that reads TBQ3 packed data and applies inverse SRHT inside the tile accumulator could achieve near-native f16 throughput. The key is amortizing the 7-stage butterfly over a full Bc tile rather than per-token.
EXP-0009
approach: native_tbq3_matmul tile_m: 16 tile_n: 16 tile_k: 128 bits: 3
dusterbloom via adaptive-chunked-prefill
Old tokens requantized turbo3_tcq to turbo2_tcq. ~30% extra memory savings at acceptable quality cost for tokens with negligible attention weight
decay_threshold_positions: 16384 source: turbo3_tcq target: turbo2_tcq
buun via cuda-rtx3090
Allow --cache-type-k "turbo3_tcq:0-31,q8_0:32-39" syntax for manual per-layer control — enables fine-grained quality/compression tradeoffs beyond fixed layer-adaptive modes
syntax: type:layer_range separator: ,
buun via cuda-rtx3090
TBQ2's aggressive 2-bit quantization allows very large KV caches. At 200K+ context adaptive chunking should keep peak VRAM bounded while maintaining acceptable PPL.
EXP-0001
context: 204800 cache_type: tbq2 approach: adaptive_chunk_sizing
dusterbloom via adaptive-chunked-prefill
There is a sweet spot between chunk sizes — smaller chunks waste kernel launches, larger chunks thrash VRAM. Profiling 256..8192 at a fixed context reveals the tradeoff.
EXP-0002
context: 32768 chunk_sizes: [256 cache_type: tbq3
dusterbloom via adaptive-chunked-prefill
EXP-0008's compressed-domain trick (eliminating 14 butterfly stages per KV token) has maximum impact on Apple Silicon where there are no tensor cores and butterfly is expensive relative to total compute. Port the compressed-domain kernel to Metal and benchmark on M-series.
EXP-0008
approach: compressed_domain_attention backend: metal head_dim: 128 bits: 3
dusterbloom via adaptive-chunked-prefill
EXP-0006 verified D=256 correctness but did not benchmark throughput. The two-butterfly approach for D=256 may have different performance characteristics than D=128 due to doubled shared memory usage and register pressure.
EXP-0006
head_dim: 256 model: gemma-3-12b contexts: [2048 approach: fused_dequant_attention
dusterbloom via adaptive-chunked-prefill