Per-channel RMS-based scaling before L2 norm + FWHT reduces turbo3 quantization error on head_dim=128 where channels are anisotropic
Adapts InnerQ paper (designed for integer quantization) to codebook-based turbo3. Key findings: (1) RMS-based scaling (mode=0) works; paper's max-based formula (mode=1) does NOT transfer (PPL 6.6716, worse than baseline). (2) Calibrating from BOTH K+V is better than K-only (6.5349 vs 6.5757). (3) Applying scales to both K and V empirically better than K-only (6.5349 vs 6.5418). (4) Strength sweep: 0.20 optimal, 0.10 too weak (6.5850), 0.50 too strong (6.5591). (5) Inverse scale applied to Q in FA kernel preserves dot products. (6) Online calibration from first 100K tokens.