TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

Created by @buun Created 2026-03-27T17:28:26Z

Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

Showing 87 experiments

ID	Title / Hypothesis	Result	Confidence	Reproductions	Metrics
cexp_55ab69	Layer-adaptive turbo3 (LA-1, first4+last4 q8_0) Promoting quality-sensitive layers to q8_0 improves PPL while maintaining compression	success	0.14	1/5	pplprefill_pp4096decode_tg64compression_ratio
cexp_59b5fc	TCQ codebook GLA optimization — MSE-PPL divergence Training codebooks to lower MSE always improves perplexity	negative	0.14	1/5	ppl_50iterppl_200iter
cexp_5c19ee	InnerQ per-channel equalization (head_dim=128) Per-channel RMS-based scaling before L2 norm + FWHT reduces turbo3 quantization error on head_dim=128 where channels are anisotropic	success	0.14	1/5	pplppl_q8_baselineppl_turbo3_no_innerq
cexp_601201	MLX implementation NIAH (@Prince_Canuma, Apple Silicon) MLX TurboQuant preserves needle-in-haystack retrieval	success	0.14	1/5	niah_2_5bit_scoreniah_2_5bit_totalniah_3_5bit_scoreniah_3_5bit_total
cexp_6237df	Context-adaptive decode-time V alpha Alpha should vary logarithmically with context length to track the changing optimal operating point	success	0.14	1/5	alpha_3bit_2kalpha_3bit_32kalpha_2bit_2kalpha_2bit_32k
cexp_697142	Trellis-Coded Quantization for KV cache TCQ (512-state bitshift trellis with Viterbi encode, O(1) sliding-window decode) improves KV cache quality over scalar Lloyd-Max	success	0.14	1/5	ppl_turbo3_tcqppl_turbo2_tcqppl_turbo2_scalarppl_q8_baseline
cexp_69a1fe	Dequant optimization — 4-mag LUT + XOR sign (Apple Silicon, M2 Pro) Reducing constant memory addresses from 8 to 4 via magnitude-only LUT with XOR sign recovery improves decode on pre-M5 Apple Silicon	success	0.14	1/5	decode_tok_s_8kdecode_ratio_vs_q8vs_ceiling_pctspeedup_vs_baseline
cexp_6e68ef	Dequant optimization — inline dequant in FA loop (FAILED) Inlining dequant into flash attention loop eliminates function call overhead	failure	0.14	1/5	decode_tok_s_8kvs_ceiling_pct
cexp_6fa654	MSE-optimal norm correction α = \|\|x\|\|·dot(x,q)/\|\|q\|\|² halves per-element MSE vs L2-preserving β = \|\|x\|\|/\|\|q\|\|	negative	0.14	1/5	ppl_l2_preservingppl_mse_optimal
cexp_765085	Asymmetric K/V type combinations Different quantization types for K vs V can improve quality/speed tradeoff	inconclusive	0.14	1/5	ppl_turbo4k_q8vppl_q8k_turbo3vppl_turbo4k_turbo3vppl_turbo3k_turbo4vdecode_q8k_turbo3vdecode_turbo4k_q8v
cexp_799341	Sparse V on q8_0 (Apple Silicon, generality test) Sparse V benefits are not turbo3-specific, also work on q8_0	success	0.14	1/5	decode_speedupppl_changeniah_change
cexp_80c0cd	Dequant optimization — deferred norm multiply (FAILED) Deferring norm multiply to after LUT lookup improves ILP	failure	0.14	1/5	decode_tok_s_8kvs_ceiling_pct
cexp_8410e2	Dequant optimization — 8-LUT baseline (reference) Establish dequant baseline with standard 8-entry constant LUT	baseline	0.14	1/5	decode_tok_s_8kdecode_ratio_vs_q8vs_ceiling_pctceiling_tok_s
cexp_8593ff	GSR Walsh ordering Reorder FWHT output by sequency to group similar-frequency components	neutral	0.14	1/5	ppl_baselineppl_walsh
cexp_8aeaf9	Sparse V dequant ON/OFF (Apple Silicon, MoE) Skipping V dequant for attention weights below 1e-6 has zero quality impact and improves decode speed	success	0.14	1/5	decode_speedup_32kppl_sparse_v_onppl_sparse_v_offniah_sparse_vniah_sparse_v_totalniah_q8_0niah_q8_0_totalskip_rate_512skip_rate_4kskip_rate_32k
cexp_9180a2	Dequant optimization — simd_shuffle cross-lane (FAILED) Cross-lane register transfer via simd_shuffle avoids constant memory entirely	failure	0.14	1/5	decode_tok_s_8kvs_ceiling_pct
cexp_926406	Dequant optimization — named-reg ternary select (FAILED) Named register variables with ternary select avoids both LUT and array indexing	failure	0.14	1/5	decode_tok_s_8kvs_ceiling_pct
cexp_92ae80	RotorQuant rotation speed comparison Clifford algebra rotor rotation is faster than dense matrix rotation	success	0.14	1/5	speedup_cuda_1kspeedup_cuda_16kspeedup_metal_1kspeedup_metal_65kparameter_reduction
cexp_93a2c9	Sign+magnitude encoding for turbo3 dequant Remapping turbo3 3-bit index to {mag_idx, sign_bit} halves register LUT pressure, improving decode	neutral	0.14	1/5	decode_tg64_4kdecode_tg64_32kdecode_tg64_4k_baselinedecode_tg64_32k_baselineppl
cexp_954f7f	Adaptive chunk sizing for chunked prefill Replacing fixed TBQ_CHUNK=4096 with cudaMemGetInfo-based calculation enables better VRAM utilization	success	0.14	1/5	ppl_2k_mmappl_2k_chunkedppl_8k_mmappl_8k_chunkedppl_32k_mmappl_32k_chunked
cexp_96a193	turbo3 baseline (Apple Silicon, MoE, head_dim=128) turbo3 achieves near-q8_0 quality on Apple Silicon with 4.6x compression	success	0.14	1/5	compression_ratioprefill_tok_sprefill_ratio_vs_q8decode_tok_s_2kdecode_ratio_vs_q8_2kdecode_tok_s_8kdecode_ratio_vs_q8_8kdecode_tok_s_32kdecode_ratio_vs_q8_32kperplexityraw_kv_kurtosispost_rotation_kurtosispost_rotation_stdexpected_stdstd_ratio
cexp_991050	128K context on 24GB GPU turbo3 enables full 128K context on 24GB GPU where q8_0 OOMs	success	0.14	1/5	prefill_pp131072decode_tg64_128kvram_gb
cexp_9ceeb7	Q-batching for chunked prefill Processing queries in batches reduces S buffer from O(nh_qnqchunk) to O(nh_qq_batchchunk)	success	0.14	1/5	s_buffer_27b_70k_before_gbs_buffer_27b_70k_after_mbppl_32k
cexp_a2f5f5	KVLinC asymmetric K/V rotation KVLinC claims rotation hurts keys — test K-only unrotated	negative	0.14	1/5	ppl_both_rotatedppl_k_unrotated_v_rotatedppl_neither_rotated
cexp_ac6e32	Viterbi double-buffered cost + global backtrace (speed) Moving backtrace from shared to global memory and double-buffering cost arrays reduces syncthreads in Viterbi encode	success	0.14	1/5	decode_dense_baselinedecode_dense_optdecode_moe_baselinedecode_moe_optppl_baselineppl_optimized
cexp_afba59	Skip Softmax — tile-level attention skipping (speed) Skipping entire KV tiles when all QK scores are far below running max reduces V dequant work at long context	negative	0.14	1/5	decode_baseline_2kdecode_skip_2kdecode_baseline_65kdecode_skip_65k
cexp_b43ed6	Product-aware codebook training (Q²-weighted GLA) Weighting codebook training by query-norm distribution (Q²) improves downstream KLD by optimizing for attention-weighted distortion	success	0.14	1/5
cexp_b7f864	Dequant optimization — bit-arithmetic mul+add (FAILED) Pure arithmetic (mul+add) to reconstruct centroids without any memory access	failure	0.14	1/5	decode_tok_s_8kvs_ceiling_pct
cexp_bbb376	Sparse V dequant (credit: TheTom) Skip V dequant for negligible attention weights (exp(score-max) < 1e-6)	success	0.14	1/5	decode_tg64_8k_beforedecode_tg64_8k_afterdecode_tg64_32kppl
cexp_bd8d18	turbo3 baseline (Apple Silicon, Dense, head_dim=128) turbo3 quality generalizes to dense architectures on Apple Silicon	success	0.14	1/5	compression_ratioperplexity

« Prev 1 2 3 Next »

Proposed Experiments

Gemma 4 K=V quantization strategy high

Gemma 4's K=V shared projections cause catastrophic V quantization (+70% PPL). Need K-only quantization or a specialized correction

buun via cuda-rtx3090 →

Verify 65K+ context on server with sufficient RAM high

Perplexity logits buffer requires >37GB host RAM at 65K context. Need to confirm adaptive chunking + Q-batching hold PPL-match at this scale.

EXP-0002

context: 65536 chunks: 8 cache_type: tbq3

dusterbloom via adaptive-chunked-prefill →

Prefill throughput benchmark at various contexts high

Adaptive chunking should show higher prefill throughput than fixed chunk=4096 at long contexts by choosing the largest viable chunk size.

EXP-0002

contexts: [2048 approach: [adaptive

dusterbloom via adaptive-chunked-prefill →

FlashInfer-style tiled TBQ3 attention kernel high

EXP-0009 showed the fused kernel's serial KV loop causes 35-43x slowdown vs tensor cores. A proper tiled approach (Bc>1 KV tokens per tile, warp-level MMA on dequanted tiles) should close the gap to 2-5x by exploiting tensor core parallelism while still avoiding full materialization of dequanted KV.

EXP-0009

approach: tiled_fused_attention tile_kv: [16 tile_q: [16 use_mma: true bits: 3

dusterbloom via adaptive-chunked-prefill →

Custom tiled matmul reading TBQ3 natively high

Instead of dequant-then-MMA, a custom WMMA kernel that reads TBQ3 packed data and applies inverse SRHT inside the tile accumulator could achieve near-native f16 throughput. The key is amortizing the 7-stage butterfly over a full Bc tile rather than per-token.

EXP-0009

approach: native_tbq3_matmul tile_m: 16 tile_n: 16 tile_k: 128 bits: 3

dusterbloom via adaptive-chunked-prefill →

Temporal decay — progressive 3-to-2 bit requantization medium

Old tokens requantized turbo3_tcq to turbo2_tcq. ~30% extra memory savings at acceptable quality cost for tokens with negligible attention weight

decay_threshold_positions: 16384 source: turbo3_tcq target: turbo2_tcq

buun via cuda-rtx3090 →

Per-layer type selection API medium

Allow --cache-type-k "turbo3_tcq:0-31,q8_0:32-39" syntax for manual per-layer control — enables fine-grained quality/compression tradeoffs beyond fixed layer-adaptive modes

syntax: type:layer_range separator: ,

buun via cuda-rtx3090 →

TBQ2 at extreme context (200K+) with adaptive chunk sizing medium

TBQ2's aggressive 2-bit quantization allows very large KV caches. At 200K+ context adaptive chunking should keep peak VRAM bounded while maintaining acceptable PPL.

EXP-0001

context: 204800 cache_type: tbq2 approach: adaptive_chunk_sizing

dusterbloom via adaptive-chunked-prefill →

Chunk size sweep at fixed context medium

There is a sweet spot between chunk sizes — smaller chunks waste kernel launches, larger chunks thrash VRAM. Profiling 256..8192 at a fixed context reveals the tradeoff.

EXP-0002

context: 32768 chunk_sizes: [256 cache_type: tbq3

dusterbloom via adaptive-chunked-prefill →

Compressed-domain attention on Apple Silicon (Metal) medium

EXP-0008's compressed-domain trick (eliminating 14 butterfly stages per KV token) has maximum impact on Apple Silicon where there are no tensor cores and butterfly is expensive relative to total compute. Port the compressed-domain kernel to Metal and benchmark on M-series.

EXP-0008

approach: compressed_domain_attention backend: metal head_dim: 128 bits: 3

dusterbloom via adaptive-chunked-prefill →

Fused kernel D=256 performance benchmark medium

EXP-0006 verified D=256 correctness but did not benchmark throughput. The two-butterfly approach for D=256 may have different performance characteristics than D=128 due to doubled shared memory usage and register pressure.

EXP-0006

head_dim: 256 model: gemma-3-12b contexts: [2048 approach: fused_dequant_attention

dusterbloom via adaptive-chunked-prefill →