PPL-optimal alpha may not be KLD-optimal — the two metrics may disagree on the best operating point
MAJOR FINDING. PPL and KLD optimize in opposite directions above alpha~1.04. Alpha=1.20 gives best PPL but worst KLD (0.112 — worse than no alpha at all). Alpha=1.04 gives best KLD (0.053) with only modest PPL improvement. The mechanism: high alpha inflates V norms, which makes the model more "confident" (lower entropy output distributions) — this improves PPL (correct tokens get higher probability) but distorts the full output distribution away from f16 reference (higher KLD). K scaling never helps KLD regardless of alpha value. KEY INSIGHT: PPL-based optimization of KV cache parameters is unreliable; KLD is the correct metric for evaluating distributional fidelity.