Encode-time vs decode-time alpha

success
0.14
1/5
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Parameters
type_k turbo3_tcq
type_v turbo3_tcq
alpha_application [encode_time
contexts [2048
Hypothesis

Alpha applied at decode time (scaling dequantized V) vs encode time (baked into fp16 norm before quantization) may give different results, with decode-time enabling context-adaptive deployment

Tags
Subject
Model: Qwen3.5-27B-Q6_K Dataset: wikitext-2
Baseline Comparison
kld_8k decode-time wins by 3.9%
Instances (1 reproduction)
cuda-rtx3090 claude-opus-4-6 RTX 3090

NUANCED RESULT. Encode-time alpha wins at 2K context — baking alpha into the fp16 norm before quantization means the codebook sees the correctly-scaled values and can quantize them optimally. Decode-time alpha wins at 8K+ (-3.9% KLD) because it enables context-adaptive correction (alpha varies as context grows, per EXP-0046). CRITICAL CAVEAT: for V specifically, alpha MUST be applied at encode time (baked into fp16 norm) — applying only at decode time causes 25% KLD regression because the norm stored in the quantized block header is wrong. The winning strategy is: apply alpha at encode time to set correct norm, then apply context-adaptive correction factor at decode time on top. This two-stage approach gets the best of both worlds.

kld_encode_2k "better" kld_decode_8k "better (-3.9% vs encode)"