Training codebooks to lower MSE always improves perplexity
IMPORTANT NEGATIVE RESULT. MSE and PPL diverge after ~50 GLA iterations. 200 iterations achieves 54.7% MSE reduction but PPL 5.91 (+1.47%), while 50 iterations at 52.8% MSE reduction gives PPL 5.83 (+0.13%). Post-FWHT distributions are near-perfect Gaussian, so synthetic training data matches real data perfectly — no synthetic-to-real gap. The divergence occurs because MSE optimizes element-wise fidelity while attention cares about relative dot-product ordering. Over-trained codebooks overfit to MSE at the expense of distributional properties that matter for softmax. 50-iteration GLA from coset initialization is optimal.