TurboQuant generalizes to vLLM inference framework on multi-GPU CUDA
Only 30.9% savings because Qwen3.5-35B-A3B has 30/40 linear attention layers that can't compress (architecture limitation). 10 full-attention layers compress 4.4x. Kept QJL (Algorithm 2) unlike our implementation. His 4/5 NIAH is actually 5/5 (model reformats one answer, parsing issue). Decode degrades at long context same as Metal but less severe (CUDA dequant cheaper than Metal LUT).