TurboQuant KV Cache Optimization

Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.

Created by @buun Created 2026-03-27T17:28:26Z
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

36 resources tracked

other
?
https://github.com/HCOOOH/PatternKV
https://github.com/HCOOOH/PatternKV
?
https://github.com/OpenBitSys/BitDecoding
https://github.com/OpenBitSys/BitDecoding
?
https://github.com/goodevening13/aquakv
https://github.com/goodevening13/aquakv
?
https://github.com/Red-Hat-AI-Innovation-Team/SQuat
https://github.com/Red-Hat-AI-Innovation-Team/SQuat
?
https://github.com/ZunhaiSu/RotateKV
https://github.com/ZunhaiSu/RotateKV
?
https://github.com/nicoboss/llama.cpp/tree/TurboQuant
https://github.com/nicoboss/llama.cpp/tree/TurboQuant
?
https://github.com/spiritbuun/llama-cpp-turboquant-cuda
https://github.com/spiritbuun/llama-cpp-turboquant-cuda
paper
Ar
https://github.com/42Shawn/Butterflyquant-llm
https://github.com/42Shawn/Butterflyquant-llm
Ar
https://github.com/nicoboss/turboquant_plus
https://github.com/nicoboss/turboquant_plus
Ar
https://github.com/tonbistudio/turboquant-pytorch
https://github.com/tonbistudio/turboquant-pytorch
Ar
https://github.com/0xSero/TurboQuant-Triton
https://github.com/0xSero/TurboQuant-Triton
Ar
https://github.com/ggml-org/llama.cpp/discussions/20969
https://github.com/ggml-org/llama.cpp/discussions/20969
Ar
https://github.com/ggml-org/llama.cpp/pull/20977
https://github.com/ggml-org/llama.cpp/pull/20977