turbo3 quality on head_dim=128 models — TurboQuant KV Cache Optimization

Parameters

type_k turbo3

type_v turbo3

rotation fwht

norm_correction l2_preserving

Hypothesis

turbo3 quality generalizes across architectures

Tags

Subject

Model: multiple Dataset: wikitext-2

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

KEY OPEN PROBLEM. head_dim=256 excellent (<0.3%). head_dim=128 degrades 2-4%. Root cause analysis: FWHT over 128 dims has fewer butterfly stages (7 vs 8) giving weaker mixing. Random sign arrays achieve concentration bound of O(1/sqrt(d)) — halving d doubles relative variance. The 8 Lloyd-Max centroids quantize the rotated distribution, but with d=128 the post-FWHT distribution has heavier tails. Potential fixes: CAT alignment (TODO-002/003), SQuat projection (TODO-001), asymmetric K/V bits (TODO-005). Gemma-3 had additional SWA bug (V un-rotation missing in iSWA path, fixed separately).

ppl_delta_qwen35_27b_hd256 "+0.2%" ppl_delta_qwen35_35b_moe_hd256 "+0.3%" ppl_delta_mn_violet_12b_hd128 "+2.6%" ppl_delta_qwen3_14b_hd128 "+3.8%" ppl_delta_gemma3_27b_hd128 "+3.3%"