TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

Created by @buun Created 2026-03-27T17:28:26Z
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
96
Experiments
46
Successes
11
Failures
1
Conflicts
3
Forks
Top Experiments
Title Result Confidence Repro Metrics
turbo4 K Q pre-rotation bug fix
turbo4 K produces garbage because Q pre-rotation guard only checks TURBO3_0, not TURBO4_0
success
0.14
1/5
TurboQuant vs rotated q4_0/q8_0 (upstream PR #21038)
turbo3 at 3.5 bpv competes with upstream rotated q4_0 at 4.5 bpv
success
0.14
1/5
Dequant optimization — non-vec FA kernel (FAILED)
Forcing non-vectorized FA kernel (nl=2) improves single-token decode
failure
0.14
1/5
Native VEC decode (scalar dequant in attention kernel)
Reading turbo3 directly in VEC attention kernel (scalar dequant, no fp16 buffer) saves 5x bandwidth by avoiding fp16 materialization
negative
0.14
1/5
Multi-sequence (n_seq > 1) dequant fix
turbo dequant-to-fp16 kernels ignore stream dimension ne[3], causing catastrophic PPL with n_seq > 1
success
0.14
1/5