PPL vs KLD divergence — alpha optimization — TurboQuant KV Cache Optimization

Consensus Metrics

ppl_optimal_alpha 1.2 (n=1, σ=0)

kld_optimal_alpha 1.04 (n=1, σ=0)

Parameters

type_k turbo3_tcq

type_v turbo3_tcq

alpha_v [1.00

metric [ppl

Hypothesis

PPL-optimal alpha may not be KLD-optimal — the two metrics may disagree on the best operating point

Tags

Subject

Model: Qwen3.5-27B-Q6_K Dataset: wikitext-2

Baseline Comparison

kld_best_vs_ppl_best 53% lower KLD at alpha=1.04 vs alpha=1.20

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

MAJOR FINDING. PPL and KLD optimize in opposite directions above alpha~1.04. Alpha=1.20 gives best PPL but worst KLD (0.112 — worse than no alpha at all). Alpha=1.04 gives best KLD (0.053) with only modest PPL improvement. The mechanism: high alpha inflates V norms, which makes the model more "confident" (lower entropy output distributions) — this improves PPL (correct tokens get higher probability) but distorts the full output distribution away from f16 reference (higher KLD). K scaling never helps KLD regardless of alpha value. KEY INSIGHT: PPL-based optimization of KV cache parameters is unreliable; KLD is the correct metric for evaluating distributional fidelity.

ppl_optimal_alpha 1.2 kld_optimal_alpha 1.04