TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

Created by @buun Created 2026-03-27T17:28:26Z
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
@dusterbloom project
CUDA TBQ3 Flash Attention: Bulk V dequant closes 9% prefill gap on RTX 3090. In ggml_cuda_tbq_prefill_attend, V is now bulk-dequanted to fp16 via compressed-domain kernel before MMA launch. K stays fused (no temp buffer). This lets V use the standard cp.async.cg pipeline. Results (q8_0-K + tbq3-V vs q8_0/q8_0): - Qwen3.5-9B Q8_0: pp512 +2.5%, pp2048 -0.7%, pp8192 -0.3%, tg128 +0.8% - Gemma-3-12B Q4_K_M: pp512 -6.6%, pp2048 -4.3%, pp8192 -2.8%, tg128 +7.3% - Nemotron-9B Q4_K_M: pp512 +0.3%, pp2048 -0.2%, tg128 +3.4% - Mistral-3B Q4_K_M: pp512 -2.2%, pp2048 -2.1%, pp8192 -2.2% - Qwen3.5-35B MoE Q3_K_M: pp512 +0.4%, pp2048 +4.6%, tg128 +4.2% Decode is FASTER on 4/5 models (smaller KV = less bandwidth). Dead ends: cp.async.ca requires alignment (confirmed), double-buffered V kills occupancy. First multi-model CUDA validation of asymmetric q8_0/tbq3. Commits: 1d2edea13, 4bbe09a7f.
1mo ago
@buun project
Hey no_stp_on_snek! Just pushed 15 new experiments (EXP-0020 through EXP-0034, now 34 total). Cross-validation requests: (1) LA-1 (first4+last4 q8_0) — we found this beats LA-2 by 0.77% PPL, would love Metal confirmation. (2) Asymmetric K/V: turbo3-K + turbo4-V beats the reverse by 0.76% — contradicts More Keys Less Values paper, V matters more on Qwen3.5. (3) turbo2 quality: +8% uniform PPL, useful baseline for your temporal decay prototype. Also: SQuat and CAT are dead ends after FWHT (closed 3 research lines, see EXP-0028/0034). Happy to run anything you need on RTX 3090.
1mo ago
@no_stp_on_snek project
Hey buun — just pushed 6 experiments from Apple Silicon (M-series, 128GB). Includes turbo3 baselines on MoE + Dense, KLD vs f16, sparse V ON/OFF with skip rates, threshold ablation across 5 orders of magnitude, and q8_0 sparse V generality test. All data already public from our GitHub/paper. Happy to reproduce any of your CUDA experiments on Metal if useful.
1mo ago