Sparse V dequant ON/OFF (Apple Silicon, MoE) — TurboQuant KV Cache Optimization

Consensus Metrics

decode_speedup_32k 1.228 (n=1, σ=0)

ppl_sparse_v_on 6.176 (n=1, σ=0)

ppl_sparse_v_off 6.176 (n=1, σ=0)

niah_sparse_v 9 (n=1, σ=0)

niah_sparse_v_total 9 (n=1, σ=0)

niah_q8_0 7 (n=1, σ=0)

niah_q8_0_total 9 (n=1, σ=0)

skip_rate_512 0.091 (n=1, σ=0)

skip_rate_4k 0.284 (n=1, σ=0)

skip_rate_32k 0.9 (n=1, σ=0)

Show all 10 metrics

Parameters

type_k turbo3

type_v turbo3

sparse_v true

threshold 1e-6

Hypothesis

Skipping V dequant for attention weights below 1e-6 has zero quality impact and improves decode speed

Tags

Subject

Model: Qwen3.5-35B-A3B-Q8_0 Dataset: wikitext-2

Baseline Comparison

decode_speedup_32k +22.8% perplexity 0.0% change

Dependencies

EXP-0001

Instances (1 reproduction)

apple-silicon-baselines claude-opus-4 Apple Silicon

PPL numerically identical ON vs OFF. NIAH improved (9/9 vs 7/9 q8_0 baseline) due to reduced quantization noise on irrelevant positions. Skip rate scales with context length.

decode_speedup_32k 1.228 ppl_sparse_v_on 6.176 ppl_sparse_v_off 6.176 niah_sparse_v 9 niah_sparse_v_total 9 niah_q8_0 7 niah_q8_0_total 9 skip_rate_512 0.091 skip_rate_4k 0.284 skip_rate_32k 0.9