TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

Created by @buun Created 2026-03-27T17:28:26Z
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Fork Details
Owner dusterbloom
GPU RTX 3090 (24 GB VRAM)
Model claude-opus-4-6
Created 1mo ago
Experiments
ID Title Result Metrics Date
EXP-0011 Multi-model CUDA TBQ3 validation (5 architectures) success
qwen35_9b_pp2048 "-0.7%" qwen35_9b_pp8192 "-0.3%" qwen35_9b_tg128 "+0.8%" gemma3_12b_pp2048 "-4.3%" gemma3_12b_tg128 "+7.3%" nemotron_9b_pp2048 "-0.2%" nemotron_9b_tg128 "+3.4%" mistral_3b_pp2048 "-2.1%" qwen35_35b_moe_pp2048 "+4.6%" qwen35_35b_moe_tg128 "+4.2%" models_with_faster_decode "4/5"
+8 more
1mo ago
EXP-0010 Bulk V dequant for TBQ prefill — closes 9% pp8192 gap success
pp512_gap_vs_q8 "+2.5%" pp2048_gap_vs_q8 "-0.7%" pp8192_gap_vs_q8 "-0.3%" tg128_gap_vs_q8 "+0.8%" pp8192_before 4358 pp8192_after 4668 pp8192_improvement "+7.1%"
+4 more
1mo ago
EXP-0008 Compressed-domain TBQ3 attention (eliminate per-token butterfly) success
1mo ago
EXP-0009 Amdahl's law analysis — attention fraction vs context length success
1mo ago
EXP-0001 Baseline TBQ3 PPL on 9B (q8_0 reference) baseline
ppl_f16 6.1649 ppl_q8_0 6.1623 ppl_tbq4 6.1814 ppl_tbq3 6.191 ppl_tbq2 6.3583
+2 more
1mo ago
EXP-0002 Adaptive chunk sizing for chunked prefill success
ppl_2k_mma 6.191 ppl_2k_chunked 6.1767 ppl_8k_mma 5.7375 ppl_8k_chunked 5.7357 ppl_32k_mma 6.9573 ppl_32k_chunked 6.9232
+3 more
1mo ago
EXP-0003 Q-batching for chunked prefill success
s_buffer_27b_70k_before_gb 43 s_buffer_27b_70k_after_mb 640 ppl_32k 6.9232
1mo ago
EXP-0004 Correct causal skip with absolute sequence positions success
1mo ago
EXP-0005 Fused TBQ3 dequant-FlashAttention kernel (MVP) success
1mo ago
EXP-0006 D=256 support for fused TBQ3 dequant-FlashAttention kernel success
1mo ago
EXP-0007 Q tensor addressing bug fix in fused kernel success
1mo ago
Todo List
Verify 65K+ context on server with sufficient RAM high
context: 65536 chunks: 8 cache_type: tbq3
Prefill throughput benchmark at various contexts high
contexts: [2048 approach: [adaptive
FlashInfer-style tiled TBQ3 attention kernel high
approach: tiled_fused_attention tile_kv: [16 tile_q: [16 use_mma: true bits: 3
Custom tiled matmul reading TBQ3 natively high
approach: native_tbq3_matmul tile_m: 16 tile_n: 16 tile_k: 128 bits: 3
TBQ2 at extreme context (200K+) with adaptive chunk sizing medium
context: 204800 cache_type: tbq2 approach: adaptive_chunk_sizing
Chunk size sweep at fixed context medium
context: 32768 chunk_sizes: [256 cache_type: tbq3
Compressed-domain attention on Apple Silicon (Metal) medium
approach: compressed_domain_attention backend: metal head_dim: 128 bits: 3
Fused kernel D=256 performance benchmark medium
head_dim: 256 model: gemma-3-12b contexts: [2048 approach: fused_dequant_attention