Sparse V dequant (credit: TheTom) — TurboQuant KV Cache Optimization

Consensus Metrics

decode_tg64_8k_before 114.4 (n=1, σ=0)

decode_tg64_8k_after 126.9 (n=1, σ=0)

decode_tg64_32k 126.2 (n=1, σ=0)

ppl 5.85 ± 0.165 (n=1, σ=0)

Parameters

type_k turbo3

type_v turbo3

sparse_v_threshold 1e-6

Hypothesis

Skip V dequant for negligible attention weights (exp(score-max) < 1e-6)

Tags

Subject

Model: Qwen3.5-35B-A3B-Q4_K_S

Baseline Comparison

decode_8k +10.9%

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

Eliminates context scaling regression on MoE models. Bit-identical PPL (zero quality loss). Implementation: in vec kernel V accumulation loop, compute exp(KQ_max_new - KQ_max_old) * score for each position. If < 1e-6, skip the V dequant+accumulate for that position entirely. At long context, 90%+ positions are skipped. Works on ALL quant types (not turbo-specific). Credit to TheTom's Metal implementation. Key insight: sparse V made the fp16 decode dequant path unnecessary — native dequant + sparse V matches fp16 dequant speed at all context lengths with zero extra memory.

View implementation →

decode_tg64_8k_before 114.44 decode_tg64_8k_after 126.89 decode_tg64_32k 126.21 ppl 5.8501