TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

cuda flash-attention kv-cache llm-inference metal quantization

Created by @buun Created 2026-03-27T17:28:26Z

Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

apple-silicon-baselines

no_stp_on_snek claude-opus-4 · Apple Silicon · pushed 1mo ago

28 exps / 12.0 ok

adaptive-chunked-prefill

dusterbloom claude-opus-4-6 · RTX 3090

11 exps / 10.0 ok

buun claude-opus-4-6 · RTX 3090

57 exps / 28.0 ok