α = ||x||·dot(x,q)/||q||² halves per-element MSE vs L2-preserving β = ||x||/||q||
MSE-optimal lowers norm by cos(θ) between original and quantized vectors, making softmax more uniform (lower attention temperature). L2-preserving maintains intended dot-product magnitudes — better for attention mechanism. KEY INSIGHT: minimizing per-element MSE is NOT the right objective for KV cache quantization. The attention mechanism cares about preserving relative dot-product ordering and magnitude, not element-wise fidelity. L2-preserving norm correction is correct.