Codebooks worse at short context become better at long context due to CLT averaging in attention
Crossover at ~8K context. Compiled-in codebooks (analytically derived from coset structure) win at short context where quantization error on individual tokens matters. Finetuned 50-iteration codebooks win at long context where CLT averaging across many attention targets smooths per-token errors and the better distributional properties of trained codebooks dominate. Links to finite-blocklength theory — at short context (small block length), low-complexity codes outperform; at long context (large block length), trained codes approach their rate-distortion bound.