TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

Created by @buun Created 2026-03-27T17:28:26Z

Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

Showing 87 experiments

ID	Title / Hypothesis	Result	Confidence	Reproductions	Metrics
cexp_bf9572	Gemma 4 architecture — K=V quantization sensitivity Gemma 4's K=V shared projections interact differently with KV cache quantization due to correlated K/V errors	inconclusive	0.14	1/5	kld_gemma4_q8kld_qwen35_q8
cexp_c05db2	Dequant optimization — 2-pair half2 + ternary (FAILED) Processing 2 elements at once with half2 and ternary sign reduces address count	failure	0.14	1/5	decode_tok_s_8kvs_ceiling_pct
cexp_c08e7e	Sparse V threshold ablation (Apple Silicon, MoE) Threshold choice across 5 orders of magnitude does not affect quality	neutral	0.14	1/5	ppl_1e4ppl_1e5ppl_1e6ppl_1e7ppl_1e8
cexp_c4ea9d	Dequant optimization — fused block dot per-centroid Q accum (FAILED) Flipping computation to iterate over centroids and accumulate matching Q elements avoids per-element lookup	negative	0.14	1/5	decode_tok_s_8kvs_ceiling_pct
cexp_c9ca34	Attention sink token protection Storing first N tokens at fp16 improves PPL (sink tokens get disproportionate attention)	neutral	0.14	1/5	ppl_no_sinkppl_4_sinksppl_8_sinksppl_16_sinks
cexp_cfe4ca	Weight GEMM dominance — attention fraction analysis KV cache operations are negligible fraction of total decode compute at batch_size=1	success	0.14	1/5
cexp_d4375d	Baseline TBQ3 PPL on 9B (q8_0 reference) Establish baseline PPL values for TBQ types on Qwen3.5-9B	baseline	0.14	1/5	ppl_f16ppl_q8_0ppl_tbq4ppl_tbq3ppl_tbq2
cexp_d6b923	NIAH multi-key validation (3 distractors, Apple Silicon) turbo3 maintains retrieval accuracy with multiple distractor needles through 32K context	success	0.14	1/5	retrieval_pct
cexp_dd82bb	Dequant optimization — FMA branchless (FAILED) Fully branchless FMA chain with zero memory access beats constant LUT	failure	0.14	1/5	decode_tok_s_8kvs_ceiling_pct
cexp_e22a96	Gemma-3 SWA V cache bug fix Gemma-3 V cache broken because V un-rotation missing from iSWA build_attn overload	success	0.14	1/5	ppl_q8_baselineppl_turbo3_kv_after_fixppl_turbo3_k_onlyppl_turbo3_kv_before_fix
cexp_e426cb	Fused weight quantize + GEMM (speed) Fusing activation quantization into the weight GEMM kernel eliminates separate q8_1 quantize kernel launch + L2 traffic	negative	0.14	1/5	decode_baselinedecode_f32_fmadecode_inreg_dp4adecode_smem_dp4a
cexp_e5416b	turbo3 baseline quality (head_dim=256) turbo3 with FWHT rotation + norm correction matches q8_0 quality	success	0.14	1/5	pplppl_q8_baselinecompression_ratio
cexp_e66662	Drop QJL from turbo4 QJL correction in turbo4 is unnecessary overhead	failure	0.14	1/5	ppl_with_qjlppl_without_qjlprefill_without_qjlprefill_with_qjl
cexp_e68bfa	Encode-time vs decode-time alpha Alpha applied at decode time (scaling dequantized V) vs encode time (baked into fp16 norm before quantization) may give different results, with decode-time enabling context-adaptive deployment	success	0.14	1/5
cexp_ec2202	Dequant optimization — batched byte extract 8-LUT (FAILED) Better byte extraction with 8-entry LUT improves over baseline	failure	0.14	1/5	decode_tok_s_8kvs_ceiling_pct
cexp_ee1dd9	CAT alignment correction analysis Per-channel scaling before FWHT reduces head_dim=128 quality gap by aligning channel variances	negative	0.14	1/5	theoretical_gain
cexp_f0f034	NSNQuant per-token DC removal Subtract per-element mean before FWHT to improve quantization	negative	0.14	1/5	ppl_turbo3_baselineppl_turbo3_dcppl_turbo4_baselineppl_turbo4_dc
cexp_f2b4aa	KL divergence vs f16 (Apple Silicon, MoE + Dense) turbo3 KLD tracks bit rate, not implementation quality	neutral	0.14	1/5	kld_moe_turbo3kld_moe_q4_0kld_moe_q8_0kld_dense_turbo3kld_dense_q4_0kld_dense_q8_0same_top_p_moe_turbo3same_top_p_dense_turbo3
cexp_f85c20	CUDA Graphs for decode pipeline (speed) Capturing decode step as CUDA Graph eliminates per-kernel launch overhead (hundreds of launches per token)	success	0.14	1/5	decode_baselinedecode_cuda_graphs
cexp_3e9a2e	Multi-model CUDA TBQ3 validation (5 architectures) Asymmetric q8_0/tbq3 will maintain <5% prefill gap and decode parity across diverse model architectures	success	0.08	1/5
cexp_466df6	Amdahl's law analysis — attention fraction vs context length The fused kernel's end-to-end slowdown is amplified/masked by non-attention compute (FFN, norms, embeddings). By comparing f16 KV (zero dequant overhead) vs TBQ3 baseline vs TBQ3 fused, we can isolate the attention-only time and understand what fraction of total time is available for optimization.	success	0.08	1/5
cexp_59b1a9	Compressed-domain TBQ3 attention (eliminate per-token butterfly) Since the SRHT (H, diag(r)) is fixed across all tokens, we can pre-rotate Q with forward SRHT once, compute K scores directly against centroids (no K butterfly), accumulate V in compressed domain (no V butterfly), and apply ONE inverse SRHT at the end. This eliminates 14 butterfly stages per KV token from the inner loop, reducing per-token compute by ~8x.	success	0.08	1/5
cexp_6d957c	Correct causal skip with absolute sequence positions Using absolute query positions (nkv-nq+q_start+q_len) for causal skip correctly prunes Q batches without affecting PPL, saving 38-47% compute during full prefill	success	0.08	1/5
cexp_70847d	D=256 support for fused TBQ3 dequant-FlashAttention kernel Adding head_dim=256 support to the fused TBQ3 attention kernel enables it to work with Qwen3.5-27B and Gemma-3 models. The kernel needs per-block SRHT (two independent 128-element butterflies for D=256) and FLOATS_PER_LANE=D/WARP_SIZE generalization.	success	0.08	1/5
cexp_c69dc2	Fused TBQ3 dequant-FlashAttention kernel (MVP) Fusing TBQ3 dequant (inverse SRHT) directly into a FlashAttention-style online softmax kernel eliminates all intermediate buffers (k_tmp, v_tmp, S) while producing identical results	success	0.08	1/5
cexp_e7c22d	Bulk V dequant for TBQ prefill — closes 9% pp8192 gap Bulk-dequanting V to fp16 before MMA launch (instead of fusing V dequant into the tile loader) will close the pp8192 gap because V can use the standard cp.async.cg pipeline	success	0.08	1/5	pp8192_beforepp8192_after
cexp_e7e920	Q tensor addressing bug fix in fused kernel The fused kernel's Q addressing was swapped — it used q_head * (nq * D) instead of q_idx * (nh_q * D). After ggml_permute(0,2,1,3), Q physical layout is [nq, nh_q, D] (token-major) not [nh_q, nq, D] (head-major). Fixing this should make PPL match baseline.	success	0.08	1/5

« Prev 1 2 3

Proposed Experiments

Gemma 4 K=V quantization strategy high

Gemma 4's K=V shared projections cause catastrophic V quantization (+70% PPL). Need K-only quantization or a specialized correction

buun via cuda-rtx3090 →

Verify 65K+ context on server with sufficient RAM high

Perplexity logits buffer requires >37GB host RAM at 65K context. Need to confirm adaptive chunking + Q-batching hold PPL-match at this scale.

EXP-0002

context: 65536 chunks: 8 cache_type: tbq3

dusterbloom via adaptive-chunked-prefill →

Prefill throughput benchmark at various contexts high

Adaptive chunking should show higher prefill throughput than fixed chunk=4096 at long contexts by choosing the largest viable chunk size.

EXP-0002

contexts: [2048 approach: [adaptive

dusterbloom via adaptive-chunked-prefill →

FlashInfer-style tiled TBQ3 attention kernel high

EXP-0009 showed the fused kernel's serial KV loop causes 35-43x slowdown vs tensor cores. A proper tiled approach (Bc>1 KV tokens per tile, warp-level MMA on dequanted tiles) should close the gap to 2-5x by exploiting tensor core parallelism while still avoiding full materialization of dequanted KV.

EXP-0009

approach: tiled_fused_attention tile_kv: [16 tile_q: [16 use_mma: true bits: 3

dusterbloom via adaptive-chunked-prefill →

Custom tiled matmul reading TBQ3 natively high

Instead of dequant-then-MMA, a custom WMMA kernel that reads TBQ3 packed data and applies inverse SRHT inside the tile accumulator could achieve near-native f16 throughput. The key is amortizing the 7-stage butterfly over a full Bc tile rather than per-token.

EXP-0009

approach: native_tbq3_matmul tile_m: 16 tile_n: 16 tile_k: 128 bits: 3

dusterbloom via adaptive-chunked-prefill →

Temporal decay — progressive 3-to-2 bit requantization medium

Old tokens requantized turbo3_tcq to turbo2_tcq. ~30% extra memory savings at acceptable quality cost for tokens with negligible attention weight

decay_threshold_positions: 16384 source: turbo3_tcq target: turbo2_tcq

buun via cuda-rtx3090 →

Per-layer type selection API medium

Allow --cache-type-k "turbo3_tcq:0-31,q8_0:32-39" syntax for manual per-layer control — enables fine-grained quality/compression tradeoffs beyond fixed layer-adaptive modes

syntax: type:layer_range separator: ,

buun via cuda-rtx3090 →

TBQ2 at extreme context (200K+) with adaptive chunk sizing medium

TBQ2's aggressive 2-bit quantization allows very large KV caches. At 200K+ context adaptive chunking should keep peak VRAM bounded while maintaining acceptable PPL.

EXP-0001

context: 204800 cache_type: tbq2 approach: adaptive_chunk_sizing

dusterbloom via adaptive-chunked-prefill →

Chunk size sweep at fixed context medium

There is a sweet spot between chunk sizes — smaller chunks waste kernel launches, larger chunks thrash VRAM. Profiling 256..8192 at a fixed context reveals the tradeoff.

EXP-0002

context: 32768 chunk_sizes: [256 cache_type: tbq3

dusterbloom via adaptive-chunked-prefill →

Compressed-domain attention on Apple Silicon (Metal) medium

EXP-0008's compressed-domain trick (eliminating 14 butterfly stages per KV token) has maximum impact on Apple Silicon where there are no tensor cores and butterfly is expensive relative to total compute. Port the compressed-domain kernel to Metal and benchmark on M-series.

EXP-0008

approach: compressed_domain_attention backend: metal head_dim: 128 bits: 3

dusterbloom via adaptive-chunked-prefill →

Fused kernel D=256 performance benchmark medium

EXP-0006 verified D=256 correctness but did not benchmark throughput. The two-butterfly approach for D=256 may have different performance characteristics than D=128 due to doubled shared memory usage and register pressure.

EXP-0006

head_dim: 256 model: gemma-3-12b contexts: [2048 approach: fused_dequant_attention

dusterbloom via adaptive-chunked-prefill →