Product-aware codebook training (Q²-weighted GLA)

success

0.14

1/5

Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

Parameters

type_k turbo3_tcq

type_v turbo3_tcq

codebook_training product_aware_gla

variants [product_mono_iter080

contexts [2048

Hypothesis

Weighting codebook training by query-norm distribution (Q²) improves downstream KLD by optimizing for attention-weighted distortion

Tags

Subject

Model: Qwen3.5-27B-Q6_K Dataset: wikitext-2

Baseline Comparison

kld_3bit_8k -9.8% vs compiled-in kld_2bit_2k -12.8% vs compiled-in

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

Product-aware training weights each codebook centroid's importance by the probability that a query will activate it (via Q² weighting — the squared query norm distribution over codebook regions). product_mono/iter080 beats compiled-in by 7.2% KLD at 2K, product_mono/iter100 beats by 9.8% at 8K. 2-bit benefits more from training than 3-bit (12.8% vs 7.2% at 2K) because 2-bit codebooks have fewer centroids and each centroid's placement matters more. The key insight is that not all quantization errors are equal — errors on high-attention-weight elements cost more than errors on ignored elements.

kld_improvement_3bit_2k "7.2%" kld_improvement_3bit_8k "9.8%" kld_improvement_2bit_2k "12.8%"