| Project | Experiment | Result | Confidence | Repro |
|---|---|---|---|---|
| TurboQuant KV Cache Optimization |
0xSero vLLM implementation (8x RTX 3090, Qwen3.5-35B-A3B MoE)
TurboQuant generalizes to vLLM inference framework on multi-GPU CUDA
|
success |
1/5
|
| Project | Fork | Experiment | Result | Date |
|---|---|---|---|---|
| TurboQuant KV Cache Optimization | apple-silicon-baselines 0xSero |
0xSero vLLM implementation (8x RTX 3090, Qwen3.5-35B-A3B MoE)
Only 30.9% savings because Qwen3.5-35B-A3B has 30/40 linear attention layers that can't compress (architecture limitation). 10 full-attention layers compress 4.4x. Kept QJL (Algorithm 2) unlike our implementation. His 4/5 NIAH is actually 5/5 (model reformats one answer, parsing issue). Decode degrades at long context same as Metal but less severe (CUDA dequant cheaper than Metal LUT).
|
success | 2026-03-27T00:00:00Z |