Weighting codebook training by query-norm distribution (Q²) improves downstream KLD by optimizing for attention-weighted distortion
Product-aware training weights each codebook centroid's importance by the probability that a query will activate it (via Q² weighting — the squared query norm distribution over codebook regions). product_mono/iter080 beats compiled-in by 7.2% KLD at 2K, product_mono/iter100 beats by 9.8% at 8K. 2-bit benefits more from training than 3-bit (12.8% vs 7.2% at 2K) because 2-bit codebooks have fewer centroids and each centroid's placement matters more. The key insight is that not all quantization errors are equal — errors on high-attention-weight elements cost more than errors on ignored elements.