TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

Created by @buun Created 2026-03-27T17:28:26Z
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

Showing 87 experiments

ID Title / Hypothesis Result Confidence Reproductions Metrics
cexp_015c90
turbo4 K produces garbage because Q pre-rotation guard only checks TURBO3_0, not TURBO4_0
success
0.14
1/5
ppl_turbo4_kv_qwen35_27b_fixedppl_turbo4_kv_qwen3_14b_fixedppl_turbo4_kv_qwen3_14b_brokenppl_turbo4v_qwen3_14b
cexp_01f12d
turbo3 at 3.5 bpv competes with upstream rotated q4_0 at 4.5 bpv
success
0.14
1/5
ppl_f16_baselineppl_turbo3_2kppl_q4_0_rot_2k
cexp_028768
Forcing non-vectorized FA kernel (nl=2) improves single-token decode
failure
0.14
1/5
decode_tok_s_8kvs_ceiling_pct
cexp_09836a
Reading turbo3 directly in VEC attention kernel (scalar dequant, no fp16 buffer) saves 5x bandwidth by avoiding fp16 materialization
negative
0.14
1/5
decode_fp16_mmadecode_native_vec
cexp_0b0172
turbo dequant-to-fp16 kernels ignore stream dimension ne[3], causing catastrophic PPL with n_seq > 1
success
0.14
1/5
ppl_nseq1_beforeppl_nseq1_afterppl_nseq2_beforeppl_nseq2_afterppl_nseq4_beforeppl_nseq4_afterppl_2k_8c_la1ppl_2k_8c_q8ppl_4k_4c_la1ppl_4k_4c_q8ppl_8k_4c_la1ppl_8k_4c_q8ppl_8k_4c_uniform
cexp_10bd78
Accepting ~1% PPL regression from fp16 round-trip enables 1.9x prefill speedup for turbo4
success
0.14
1/5
prefill_pp4096_beforeprefill_pp4096_afterppl_full_precisionppl_fp16_prefilldecode_tg64
cexp_117f61
Auto-detect (max_scale_ratio < 1.2 → disable) prevents InnerQ from hurting well-balanced head_dim=256 distributions
success
0.14
1/5
pplppl_turbo3_baselineppl_innerq_forcedmax_scale_ratio_detected
cexp_1828c2
Codebooks worse at short context become better at long context due to CLT averaging in attention
success
0.14
1/5
ppl_compiled_2kppl_finetuned_2kppl_compiled_32kppl_finetuned_32k
cexp_1c3f17
Loading 2KB TCQ codebook from __constant__ memory into shared memory gives 32-bank parallel access, improving decode throughput
negative
0.14
1/5
decode_constant_2kdecode_smem_2kdecode_constant_32kdecode_smem_32k
cexp_1ff7e8
Chunked cuBLAS GEMM for Q*K and score*V during prefill outperforms fused flash attention MMA
negative
0.14
1/5
prefill_fused_mma
cexp_22b2b2
Greedy trellis encode (locally optimal at each step) trades quality for prefill speed
negative
0.14
1/5
ppl_greedyppl_multi_start_512ppl_viterbi
cexp_239fdb
TCQ trellis structure induces autocorrelation in quantization errors that could be exploited or must be modeled
negative
0.14
1/5
lag1_autocorrelationlag2_autocorrelationlag4_autocorrelation
cexp_283634
2-bit PolarQuant (4-centroid Lloyd-Max) provides maximum compression for VRAM-constrained scenarios
inconclusive
0.14
1/5
ppl_turbo2_uniformppl_turbo3k_turbo2vppl_turbo2k_turbo3vppl_turbo2k_q8vppl_turbo2_la1ppl_turbo2_la2kv_memory_4k_mibcompression_vs_fp16decode_tg128
cexp_28ca71
turbo3 on draft model KV saves VRAM and maintains acceptance rate
neutral
0.14
1/5
throughput_q8_draftthroughput_turbo3_draftn_drafted_q8n_drafted_turbo3normal_decode
cexp_2a99b4
CUDA implementation on Blackwell achieves near-parity with F16 decode
success
0.14
1/5
decode_tok_s_f16decode_tok_s_q8_0decode_tok_s_turbo3decode_ratio_vs_f16compression_ratio
cexp_2db704
PPL-optimal alpha may not be KLD-optimal — the two metrics may disagree on the best operating point
success
0.14
1/5
ppl_optimal_alphakld_optimal_alpha
cexp_2fe356
Bulk dequant turbo KV to fp16 then use MMA tensor core kernel for prefill
success
0.14
1/5
prefill_pp4096_beforeprefill_pp4096_afterprefill_ratio_vs_q8decode_tg64ppl
cexp_356f07
Scaling V norm by constant alpha after quantization compensates for systematic shrinkage and improves quality
success
0.14
1/5
cexp_3683be
ncu profiling reveals whether MMVQ weight GEMM kernel has optimization headroom
success
0.14
1/5
registers_per_threadactive_warps
cexp_39282e
turbo3 quality generalizes across architectures
negative
0.14
1/5
cexp_392e0e
Promoting last 8 layers to q8_0 improves PPL while maintaining most turbo3 compression
success
0.14
1/5
ppl_la2_turbo3ppl_la2_turbo4decode_tg64_la2_turbo3decode_tg64_la2_turbo4compression_ratio
cexp_3c4375
Multi-GPU performance matches single-GPU ratios
inconclusive
0.14
1/5
prefill_turbo3prefill_q8decode_turbo3decode_q8prefill_turbo4decode_turbo4
cexp_3e59be
turbo4 (3-bit + QJL correction) beats q8_0 on head_dim=256
success
0.14
1/5
pplppl_q8_baselinecompression_ratio
cexp_3eeb92
Madreag CUDA fork maintains decode performance at long context
negative
0.14
1/5
decode_tg128_0k_q8decode_tg128_0k_turbo3decode_ratio_0kdecode_tg128_4k_q8decode_tg128_4k_turbo3decode_ratio_4kdecode_tg128_8k_q8decode_tg128_8k_turbo3decode_ratio_8kdecode_tg128_16k_q8decode_tg128_16k_turbo3decode_ratio_16kdecode_tg128_204k_q8decode_tg128_204k_turbo3decode_ratio_204kprefill_ratio_0kprefill_ratio_16kprefill_ratio_204kkv_mem_q8_total_mibkv_mem_turbo3_total_mibkv_compression_ratio
cexp_42fdf4
Measure theoretical maximum decode speed with dequant disabled (returns zeros)
baseline
0.14
1/5
decode_tok_s_8kdecode_ratio_vs_q8
cexp_48061c
TurboQuant generalizes to vLLM inference framework on multi-GPU CUDA
success
0.14
1/5
kv_savings_pctniah_scoreniah_totalprefill_tok_s_minprefill_tok_s_maxdecode_tok_s_shortdecode_tok_s_131kkv_cache_mb_tq_131kkv_cache_mb_baseline_131kfull_attn_layers_compressioncosine_sim_3bit_keyscosine_sim_2bit_valuescosine_sim_4bit_values
cexp_4d9a59
Replacing all LUT reads with a select/ternary chain eliminates constant memory entirely
failure
0.14
1/5
decode_tok_s_8kvs_ceiling_pct
cexp_4f2e9a
MSE-only approach (no QJL) at 2-bit produces character-identical output to fp16
success
0.14
1/5
cexp_53ebee
Tuning split-K parallel_blocks parameter improves turbo decode throughput
neutral
0.14
1/5
decode_pb_autodecode_pb1decode_pb2decode_pb4decode_pb8decode_pb16decode_pb32decode_q8_baseline
cexp_558c28
FWHT rotation is essential for turbo3 quality
conflict
0.14
1/5
ppl_with_rotationppl_without_rotation_with_normppl_without_rotation_without_normppl_q8_baselineppl_la3_last4ppl_la4_first4ppl_la5_first2_last2compression_ratio_4layersppl_mode6_vonly_last8ppl_mode7_konly_last8ppl_mode8_vonly_2plus2ppl_mode2_both_last8

Proposed Experiments

Gemma 4's K=V shared projections cause catastrophic V quantization (+70% PPL). Need K-only quantization or a specialized correction
buun via cuda-rtx3090
Perplexity logits buffer requires >37GB host RAM at 65K context. Need to confirm adaptive chunking + Q-batching hold PPL-match at this scale.
EXP-0002
context: 65536 chunks: 8 cache_type: tbq3
dusterbloom via adaptive-chunked-prefill
Adaptive chunking should show higher prefill throughput than fixed chunk=4096 at long contexts by choosing the largest viable chunk size.
EXP-0002
contexts: [2048 approach: [adaptive
dusterbloom via adaptive-chunked-prefill
EXP-0009 showed the fused kernel's serial KV loop causes 35-43x slowdown vs tensor cores. A proper tiled approach (Bc>1 KV tokens per tile, warp-level MMA on dequanted tiles) should close the gap to 2-5x by exploiting tensor core parallelism while still avoiding full materialization of dequanted KV.
EXP-0009
approach: tiled_fused_attention tile_kv: [16 tile_q: [16 use_mma: true bits: 3
dusterbloom via adaptive-chunked-prefill
Instead of dequant-then-MMA, a custom WMMA kernel that reads TBQ3 packed data and applies inverse SRHT inside the tile accumulator could achieve near-native f16 throughput. The key is amortizing the 7-stage butterfly over a full Bc tile rather than per-token.
EXP-0009
approach: native_tbq3_matmul tile_m: 16 tile_n: 16 tile_k: 128 bits: 3
dusterbloom via adaptive-chunked-prefill
Old tokens requantized turbo3_tcq to turbo2_tcq. ~30% extra memory savings at acceptable quality cost for tokens with negligible attention weight
decay_threshold_positions: 16384 source: turbo3_tcq target: turbo2_tcq
buun via cuda-rtx3090
Allow --cache-type-k "turbo3_tcq:0-31,q8_0:32-39" syntax for manual per-layer control — enables fine-grained quality/compression tradeoffs beyond fixed layer-adaptive modes
syntax: type:layer_range separator: ,
buun via cuda-rtx3090
TBQ2's aggressive 2-bit quantization allows very large KV caches. At 200K+ context adaptive chunking should keep peak VRAM bounded while maintaining acceptable PPL.
EXP-0001
context: 204800 cache_type: tbq2 approach: adaptive_chunk_sizing
dusterbloom via adaptive-chunked-prefill
There is a sweet spot between chunk sizes — smaller chunks waste kernel launches, larger chunks thrash VRAM. Profiling 256..8192 at a fixed context reveals the tradeoff.
EXP-0002
context: 32768 chunk_sizes: [256 cache_type: tbq3
dusterbloom via adaptive-chunked-prefill
EXP-0008's compressed-domain trick (eliminating 14 butterfly stages per KV token) has maximum impact on Apple Silicon where there are no tensor cores and butterfly is expensive relative to total compute. Port the compressed-domain kernel to Metal and benchmark on M-series.
EXP-0008
approach: compressed_domain_attention backend: metal head_dim: 128 bits: 3
dusterbloom via adaptive-chunked-prefill
EXP-0006 verified D=256 correctness but did not benchmark throughput. The two-butterfly approach for D=256 may have different performance characteristics than D=128 due to doubled shared memory usage and register pressure.
EXP-0006
head_dim: 256 model: gemma-3-12b contexts: [2048 approach: fused_dequant_attention
dusterbloom via adaptive-chunked-prefill