TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

Created by @buun Created 2026-03-27T17:28:26Z

Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

Showing 87 experiments

ID	Title / Hypothesis	Result	Confidence	Reproductions	Metrics
cexp_015c90	turbo4 K Q pre-rotation bug fix turbo4 K produces garbage because Q pre-rotation guard only checks TURBO3_0, not TURBO4_0	success	0.14	1/5	ppl_turbo4_kv_qwen35_27b_fixedppl_turbo4_kv_qwen3_14b_fixedppl_turbo4_kv_qwen3_14b_brokenppl_turbo4v_qwen3_14b
cexp_01f12d	TurboQuant vs rotated q4_0/q8_0 (upstream PR #21038) turbo3 at 3.5 bpv competes with upstream rotated q4_0 at 4.5 bpv	success	0.14	1/5	ppl_f16_baselineppl_turbo3_2kppl_q4_0_rot_2k
cexp_028768	Dequant optimization — non-vec FA kernel (FAILED) Forcing non-vectorized FA kernel (nl=2) improves single-token decode	failure	0.14	1/5	decode_tok_s_8kvs_ceiling_pct
cexp_09836a	Native VEC decode (scalar dequant in attention kernel) Reading turbo3 directly in VEC attention kernel (scalar dequant, no fp16 buffer) saves 5x bandwidth by avoiding fp16 materialization	negative	0.14	1/5	decode_fp16_mmadecode_native_vec
cexp_0b0172	Multi-sequence (n_seq > 1) dequant fix turbo dequant-to-fp16 kernels ignore stream dimension ne[3], causing catastrophic PPL with n_seq > 1	success	0.14	1/5	ppl_nseq1_beforeppl_nseq1_afterppl_nseq2_beforeppl_nseq2_afterppl_nseq4_beforeppl_nseq4_afterppl_2k_8c_la1ppl_2k_8c_q8ppl_4k_4c_la1ppl_4k_4c_q8ppl_8k_4c_la1ppl_8k_4c_q8ppl_8k_4c_uniform
cexp_10bd78	turbo4 prefill MMA (fp16 dequant tradeoff) Accepting ~1% PPL regression from fp16 round-trip enables 1.9x prefill speedup for turbo4	success	0.14	1/5	prefill_pp4096_beforeprefill_pp4096_afterppl_full_precisionppl_fp16_prefilldecode_tg64
cexp_117f61	InnerQ auto-detect on head_dim=256 Auto-detect (max_scale_ratio < 1.2 → disable) prevents InnerQ from hurting well-balanced head_dim=256 distributions	success	0.14	1/5	pplppl_turbo3_baselineppl_innerq_forcedmax_scale_ratio_detected
cexp_1828c2	Context-length crossover for TCQ codebooks Codebooks worse at short context become better at long context due to CLT averaging in attention	success	0.14	1/5	ppl_compiled_2kppl_finetuned_2kppl_compiled_32kppl_finetuned_32k
cexp_1c3f17	TCQ codebook → shared memory (speed) Loading 2KB TCQ codebook from __constant__ memory into shared memory gives 32-bank parallel access, improving decode throughput	negative	0.14	1/5	decode_constant_2kdecode_smem_2kdecode_constant_32kdecode_smem_32k
cexp_1ff7e8	cuBLAS GEMM for flash attention prefill Chunked cuBLAS GEMM for QK and scoreV during prefill outperforms fused flash attention MMA	negative	0.14	1/5	prefill_fused_mma
cexp_22b2b2	Greedy TCQ encode (speed) Greedy trellis encode (locally optimal at each step) trades quality for prefill speed	negative	0.14	1/5	ppl_greedyppl_multi_start_512ppl_viterbi
cexp_239fdb	TCQ error autocorrelation TCQ trellis structure induces autocorrelation in quantization errors that could be exploited or must be modeled	negative	0.14	1/5	lag1_autocorrelationlag2_autocorrelationlag4_autocorrelation
cexp_283634	turbo2 (2-bit) variant 2-bit PolarQuant (4-centroid Lloyd-Max) provides maximum compression for VRAM-constrained scenarios	inconclusive	0.14	1/5	ppl_turbo2_uniformppl_turbo3k_turbo2vppl_turbo2k_turbo3vppl_turbo2k_q8vppl_turbo2_la1ppl_turbo2_la2kv_memory_4k_mibcompression_vs_fp16decode_tg128
cexp_28ca71	Speculative decoding with turbo KV turbo3 on draft model KV saves VRAM and maintains acceptance rate	neutral	0.14	1/5	throughput_q8_draftthroughput_turbo3_draftn_drafted_q8n_drafted_turbo3normal_decode
cexp_2a99b4	signalnine CUDA PR #3 (RTX 5090 Blackwell) CUDA implementation on Blackwell achieves near-parity with F16 decode	success	0.14	1/5	decode_tok_s_f16decode_tok_s_q8_0decode_tok_s_turbo3decode_ratio_vs_f16compression_ratio
cexp_2db704	PPL vs KLD divergence — alpha optimization PPL-optimal alpha may not be KLD-optimal — the two metrics may disagree on the best operating point	success	0.14	1/5	ppl_optimal_alphakld_optimal_alpha
cexp_2fe356	Prefill dequant+MMA optimization Bulk dequant turbo KV to fp16 then use MMA tensor core kernel for prefill	success	0.14	1/5	prefill_pp4096_beforeprefill_pp4096_afterprefill_ratio_vs_q8decode_tg64ppl
cexp_356f07	V norm scaling (temperature scaling) Scaling V norm by constant alpha after quantization compensates for systematic shrinkage and improves quality	success	0.14	1/5
cexp_3683be	MMVQ kernel profiling — hardware bandwidth wall ncu profiling reveals whether MMVQ weight GEMM kernel has optimization headroom	success	0.14	1/5	registers_per_threadactive_warps
cexp_39282e	turbo3 quality on head_dim=128 models turbo3 quality generalizes across architectures	negative	0.14	1/5
cexp_392e0e	Layer-adaptive mode 2 (last 8 layers q8_0) Promoting last 8 layers to q8_0 improves PPL while maintaining most turbo3 compression	success	0.14	1/5	ppl_la2_turbo3ppl_la2_turbo4decode_tg64_la2_turbo3decode_tg64_la2_turbo4compression_ratio
cexp_3c4375	Dual RTX 4090 multi-GPU validation Multi-GPU performance matches single-GPU ratios	inconclusive	0.14	1/5	prefill_turbo3prefill_q8decode_turbo3decode_q8prefill_turbo4decode_turbo4
cexp_3e59be	turbo4 baseline quality (head_dim=256) turbo4 (3-bit + QJL correction) beats q8_0 on head_dim=256	success	0.14	1/5	pplppl_q8_baselinecompression_ratio
cexp_3eeb92	dan-and Madreag CUDA fork (4x RTX 3080) Madreag CUDA fork maintains decode performance at long context	negative	0.14	1/5	decode_tg128_0k_q8decode_tg128_0k_turbo3decode_ratio_0kdecode_tg128_4k_q8decode_tg128_4k_turbo3decode_ratio_4kdecode_tg128_8k_q8decode_tg128_8k_turbo3decode_ratio_8kdecode_tg128_16k_q8decode_tg128_16k_turbo3decode_ratio_16kdecode_tg128_204k_q8decode_tg128_204k_turbo3decode_ratio_204kprefill_ratio_0kprefill_ratio_16kprefill_ratio_204kkv_mem_q8_total_mibkv_mem_turbo3_total_mibkv_compression_ratio
cexp_42fdf4	No-dequant ceiling measurement (Apple Silicon, M2 Pro) Measure theoretical maximum decode speed with dequant disabled (returns zeros)	baseline	0.14	1/5	decode_tok_s_8kdecode_ratio_vs_q8
cexp_48061c	0xSero vLLM implementation (8x RTX 3090, Qwen3.5-35B-A3B MoE) TurboQuant generalizes to vLLM inference framework on multi-GPU CUDA	success	0.14	1/5	kv_savings_pctniah_scoreniah_totalprefill_tok_s_minprefill_tok_s_maxdecode_tok_s_shortdecode_tok_s_131kkv_cache_mb_tq_131kkv_cache_mb_baseline_131kfull_attn_layers_compressioncosine_sim_3bit_keyscosine_sim_2bit_valuescosine_sim_4bit_values
cexp_4d9a59	Dequant optimization — select chain, zero LUT (FAILED) Replacing all LUT reads with a select/ternary chain eliminates constant memory entirely	failure	0.14	1/5	decode_tok_s_8kvs_ceiling_pct
cexp_4f2e9a	Dejan AI Triton kernel — MSE-only 2-bit (RTX 4090) MSE-only approach (no QJL) at 2-bit produces character-identical output to fp16	success	0.14	1/5
cexp_53ebee	parallel_blocks tuning for turbo decode Tuning split-K parallel_blocks parameter improves turbo decode throughput	neutral	0.14	1/5	decode_pb_autodecode_pb1decode_pb2decode_pb4decode_pb8decode_pb16decode_pb32decode_q8_baseline
cexp_558c28	FWHT rotation ablation FWHT rotation is essential for turbo3 quality	conflict	0.14	1/5	ppl_with_rotationppl_without_rotation_with_normppl_without_rotation_without_normppl_q8_baselineppl_la3_last4ppl_la4_first4ppl_la5_first2_last2compression_ratio_4layersppl_mode6_vonly_last8ppl_mode7_konly_last8ppl_mode8_vonly_2plus2ppl_mode2_both_last8

1 2 3 Next »

Proposed Experiments

Gemma 4 K=V quantization strategy high

Gemma 4's K=V shared projections cause catastrophic V quantization (+70% PPL). Need K-only quantization or a specialized correction

buun via cuda-rtx3090 →

Verify 65K+ context on server with sufficient RAM high

Perplexity logits buffer requires >37GB host RAM at 65K context. Need to confirm adaptive chunking + Q-batching hold PPL-match at this scale.

EXP-0002

context: 65536 chunks: 8 cache_type: tbq3

dusterbloom via adaptive-chunked-prefill →

Prefill throughput benchmark at various contexts high

Adaptive chunking should show higher prefill throughput than fixed chunk=4096 at long contexts by choosing the largest viable chunk size.

EXP-0002

contexts: [2048 approach: [adaptive

dusterbloom via adaptive-chunked-prefill →

FlashInfer-style tiled TBQ3 attention kernel high

EXP-0009 showed the fused kernel's serial KV loop causes 35-43x slowdown vs tensor cores. A proper tiled approach (Bc>1 KV tokens per tile, warp-level MMA on dequanted tiles) should close the gap to 2-5x by exploiting tensor core parallelism while still avoiding full materialization of dequanted KV.

EXP-0009

approach: tiled_fused_attention tile_kv: [16 tile_q: [16 use_mma: true bits: 3

dusterbloom via adaptive-chunked-prefill →

Custom tiled matmul reading TBQ3 natively high

Instead of dequant-then-MMA, a custom WMMA kernel that reads TBQ3 packed data and applies inverse SRHT inside the tile accumulator could achieve near-native f16 throughput. The key is amortizing the 7-stage butterfly over a full Bc tile rather than per-token.

EXP-0009

approach: native_tbq3_matmul tile_m: 16 tile_n: 16 tile_k: 128 bits: 3

dusterbloom via adaptive-chunked-prefill →

Temporal decay — progressive 3-to-2 bit requantization medium

Old tokens requantized turbo3_tcq to turbo2_tcq. ~30% extra memory savings at acceptable quality cost for tokens with negligible attention weight

decay_threshold_positions: 16384 source: turbo3_tcq target: turbo2_tcq

buun via cuda-rtx3090 →

Per-layer type selection API medium

Allow --cache-type-k "turbo3_tcq:0-31,q8_0:32-39" syntax for manual per-layer control — enables fine-grained quality/compression tradeoffs beyond fixed layer-adaptive modes

syntax: type:layer_range separator: ,

buun via cuda-rtx3090 →

TBQ2 at extreme context (200K+) with adaptive chunk sizing medium

TBQ2's aggressive 2-bit quantization allows very large KV caches. At 200K+ context adaptive chunking should keep peak VRAM bounded while maintaining acceptable PPL.

EXP-0001

context: 204800 cache_type: tbq2 approach: adaptive_chunk_sizing

dusterbloom via adaptive-chunked-prefill →

Chunk size sweep at fixed context medium

There is a sweet spot between chunk sizes — smaller chunks waste kernel launches, larger chunks thrash VRAM. Profiling 256..8192 at a fixed context reveals the tradeoff.

EXP-0002

context: 32768 chunk_sizes: [256 cache_type: tbq3

dusterbloom via adaptive-chunked-prefill →

Compressed-domain attention on Apple Silicon (Metal) medium

EXP-0008's compressed-domain trick (eliminating 14 butterfly stages per KV token) has maximum impact on Apple Silicon where there are no tensor cores and butterfly is expensive relative to total compute. Port the compressed-domain kernel to Metal and benchmark on M-series.

EXP-0008

approach: compressed_domain_attention backend: metal head_dim: 128 bits: 3

dusterbloom via adaptive-chunked-prefill →

Fused kernel D=256 performance benchmark medium

EXP-0006 verified D=256 correctness but did not benchmark throughput. The two-butterfly approach for D=256 may have different performance characteristics than D=128 due to doubled shared memory usage and register pressure.

EXP-0006

head_dim: 256 model: gemma-3-12b contexts: [2048 approach: fused_dequant_attention

dusterbloom via adaptive-chunked-prefill →