turbo3 achieves near-q8_0 quality on Apple Silicon with 4.6x compression
4.6x compression with 1.02x prefill parity. Decode degrades at long context (0.93x at 32K) due to centroid LUT bottleneck on Metal.
Validates paper's core theoretical claim on real Qwen3 KV data. Post-rotation std matches expected 1/sqrt(128) exactly (ratio 1.000). Kurtosis drops from 900 (extreme outliers) to 2.9 (near-Gaussian, where 3.0 is perfect Gaussian). This is why Lloyd-Max quantization works — the rotation makes the distribution optimal for scalar quantization.