Speculative decoding with turbo KV — TurboQuant KV Cache Optimization

Consensus Metrics

throughput_q8_draft 28.78 (n=1, σ=0)

throughput_turbo3_draft 28.85 (n=1, σ=0)

n_drafted_q8 1864 (n=1, σ=0)

n_drafted_turbo3 1936 (n=1, σ=0)

normal_decode 31 (n=1, σ=0)

Parameters

draft_type_k turbo3

draft_type_v turbo3

n_draft 8

n 256

Hypothesis

turbo3 on draft model KV saves VRAM and maintains acceptance rate

Tags

speculative_decoding speed

Subject

Model: Qwen3.5-2B-Q4_K_M (draft) + Qwen3.5-27B-Q6_K (target)

Baseline Comparison

throughput +0.2%

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

Speculative decoding is slower than normal decode for this model pair (2B draft has poor acceptance rate). turbo3 on draft KV has zero impact on throughput or acceptance because 2B model's KV cache is negligible compared to 27B target. turbo KV matters for the target model (which already uses it), not the draft. NOT RECOMMENDED — turbo in speculative decoding is a non-issue.

throughput_q8_draft 28.78 throughput_turbo3_draft 28.85 n_drafted_q8 1864 n_drafted_turbo3 1936 normal_decode 31