Encode-time vs decode-time alpha

success

0.14

1/5

Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

Parameters

type_k turbo3_tcq

type_v turbo3_tcq

alpha_application [encode_time

contexts [2048

Hypothesis

Alpha applied at decode time (scaling dequantized V) vs encode time (baked into fp16 norm before quantization) may give different results, with decode-time enabling context-adaptive deployment

Tags

Subject

Model: Qwen3.5-27B-Q6_K Dataset: wikitext-2

Baseline Comparison

kld_8k decode-time wins by 3.9%

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

NUANCED RESULT. Encode-time alpha wins at 2K context — baking alpha into the fp16 norm before quantization means the codebook sees the correctly-scaled values and can quantize them optimally. Decode-time alpha wins at 8K+ (-3.9% KLD) because it enables context-adaptive correction (alpha varies as context grows, per EXP-0046). CRITICAL CAVEAT: for V specifically, alpha MUST be applied at encode time (baked into fp16 norm) — applying only at decode time causes 25% KLD regression because the norm stored in the quantized block header is wrong. The winning strategy is: apply alpha at encode time to set correct norm, then apply context-adaptive correction factor at decode time on top. This two-stage approach gets the best of both worlds.

kld_encode_2k "better" kld_decode_8k "better (-3.9% vs encode)"