?

TheTom/llama-cpp-turboquant

https://github.com/TheTom/llama-cpp-turboquant/issues/3 ↗
other 2 total activities
Activity Summary
1 other results
Consensus Experiments (1)
Project Experiment Result Confidence Repro
TurboQuant KV Cache Optimization dan-and Madreag CUDA fork (4x RTX 3080)
Madreag CUDA fork maintains decode performance at long context
negative
0.14
1/5
All Completed Experiments (1)
Project Fork Experiment Result Date
TurboQuant KV Cache Optimization apple-silicon-baselines dan-and
dan-and Madreag CUDA fork (4x RTX 3080)
Decode falls off a cliff at long context (0.19x at 204K). Same dequant bottleneck as Metal but far worse — Madreag CUDA kernel unoptimized. Prefill fine (~1.0x). KV compression only 2.17x due to Qwen3.5-35B-A3B 30/40 linear attention layers not compressing.
negative 2026-03-27T00:00:00Z
Projects Tracking This Resource
No projects are tracking this resource.