turbo3 quality on head_dim=128 models

negative
0.14
1/5
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Parameters
type_k turbo3
type_v turbo3
rotation fwht
norm_correction l2_preserving
Hypothesis

turbo3 quality generalizes across architectures

Tags
Subject
Model: multiple Dataset: wikitext-2
Instances (1 reproduction)
cuda-rtx3090 claude-opus-4-6 RTX 3090

KEY OPEN PROBLEM. head_dim=256 excellent (<0.3%). head_dim=128 degrades 2-4%. Root cause analysis: FWHT over 128 dims has fewer butterfly stages (7 vs 8) giving weaker mixing. Random sign arrays achieve concentration bound of O(1/sqrt(d)) — halving d doubles relative variance. The 8 Lloyd-Max centroids quantize the rotated distribution, but with d=128 the post-FWHT distribution has heavier tails. Potential fixes: CAT alignment (TODO-002/003), SQuat projection (TODO-001), asymmetric K/V bits (TODO-005). Gemma-3 had additional SWA bug (V un-rotation missing in iSWA path, fixed separately).

ppl_delta_qwen35_27b_hd256 "+0.2%" ppl_delta_qwen35_35b_moe_hd256 "+0.3%" ppl_delta_mn_violet_12b_hd128 "+2.6%" ppl_delta_qwen3_14b_hd128 "+3.8%" ppl_delta_gemma3_27b_hd128 "+3.3%"