TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

cuda flash-attention kv-cache llm-inference metal quantization

Created by @buun Created 2026-03-27T17:28:26Z

Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

36 resources tracked

other

?

https://github.com/HCOOOH/PatternKV

https://github.com/HCOOOH/PatternKV

?

https://github.com/OpenBitSys/BitDecoding

https://github.com/OpenBitSys/BitDecoding

?

https://github.com/goodevening13/aquakv

https://github.com/goodevening13/aquakv

?

https://github.com/Red-Hat-AI-Innovation-Team/SQuat

https://github.com/Red-Hat-AI-Innovation-Team/SQuat

?

https://github.com/ZunhaiSu/RotateKV

https://github.com/ZunhaiSu/RotateKV

?

Alternative KV cache compression approach using similarity-based reconstruction.

https://arxiv.org/abs/2603.22910

?

Analysis of early-exit diminishing returns. Supports "skip work, don't optimize it" thesis.

https://arxiv.org/abs/2603.23701

?

MoE inference acceleration. Relevant since TurboQuant primary target is MoE models.

https://arxiv.org/abs/2603.19289

?

Mixed precision for edge inference. Relevant to Apple Silicon / mobile deployment.

https://arxiv.org/abs/2603.23575

?

Agent-driven GPU kernel optimization. Relevant to automating dequant kernel tuning.

https://arxiv.org/abs/2603.21331

?

Edge inference survey. TurboQuant compression directly enables edge deployment.

https://arxiv.org/abs/2603.23640

?

Hierarchical long-context attention mechanism. Relevant to long-context decode optimization.

https://arxiv.org/abs/2603.20843

?

Validates group-shared factor caching (+37.4% throughput). Referenced in our hardware analysis.

https://arxiv.org/abs/2603.25385

?

Per-token early exit. Another form of conditional computation elimination.

https://arxiv.org/abs/2603.21365

?

Weight quantization preserving directional fidelity. The "directional fidelity" concept is interesting for evaluating TurboQuant quality.

https://arxiv.org/abs/2603.22324

?

Residual stream analysis. Theoretical insight into what information flows through attention.

https://arxiv.org/abs/2603.19664

?

CPU bottleneck analysis for multi-GPU. Relevant to CUDA multi-GPU TurboQuant deployment.

https://arxiv.org/abs/2603.22774

?

Optimal scalar quantization theory. TurboQuant uses Lloyd-Max which is MSE-optimal for Gaussian.

https://arxiv.org/abs/2603.19559

?

RL-based Triton kernel generation. Relevant to CUDA/Triton dequant kernel optimization.

https://arxiv.org/abs/2603.21465

?

NVIDIA FP4 format-aware rounding. Relevant to sub-4-bit quantization landscape.

https://arxiv.org/abs/2603.22370

?

Post-training quantization approach. Different tradeoffs from TurboQuant's online quantization.

https://arxiv.org/abs/2603.25284

?

Self-distillation for speculative decoding. Complementary to KV cache optimization.

https://arxiv.org/abs/2603.23911

?

Validates kernel fusion for attention-aware decompression. Directly relevant to sparse V.

https://arxiv.org/abs/2603.23914

?

Mixed-dimension KV cache budget allocation. Complementary to fixed-bit TurboQuant.

https://arxiv.org/abs/2603.20632

?

Test-time quantization using activation awareness. Different dynamic quant strategy.

https://arxiv.org/abs/2603.19296

?

Photonic block selection for long-context memory wall. Novel hardware approach to the same bottleneck.

https://arxiv.org/abs/2603.21576

?

Safe CUDA kernel generation via symbolic execution. Relevant to correctness of dequant kernels.

https://arxiv.org/abs/2603.24595

?

Memory sparse attention for 100M+ tokens. Shares the "skip unnecessary attention work" insight with sparse V.

https://arxiv.org/abs/2603.23516

?

https://github.com/nicoboss/llama.cpp/tree/TurboQuant

https://github.com/nicoboss/llama.cpp/tree/TurboQuant

?

https://github.com/spiritbuun/llama-cpp-turboquant-cuda

https://github.com/spiritbuun/llama-cpp-turboquant-cuda

paper

Ar

https://github.com/42Shawn/Butterflyquant-llm

https://github.com/42Shawn/Butterflyquant-llm

Ar

https://github.com/nicoboss/turboquant_plus

https://github.com/nicoboss/turboquant_plus

Ar

https://github.com/tonbistudio/turboquant-pytorch

https://github.com/tonbistudio/turboquant-pytorch

Ar

https://github.com/0xSero/TurboQuant-Triton

https://github.com/0xSero/TurboQuant-Triton

Ar

https://github.com/ggml-org/llama.cpp/discussions/20969

https://github.com/ggml-org/llama.cpp/discussions/20969

Ar

https://github.com/ggml-org/llama.cpp/pull/20977

https://github.com/ggml-org/llama.cpp/pull/20977