Greedy TCQ encode (speed) — TurboQuant KV Cache Optimization

Consensus Metrics

ppl_greedy 17.09 (n=1, σ=0)

ppl_multi_start_512 14.74 (n=1, σ=0)

ppl_viterbi 5.83 (n=1, σ=0)

Parameters

type_k turbo3_tcq

type_v turbo3_tcq

encoder [viterbi

Hypothesis

Greedy trellis encode (locally optimal at each step) trades quality for prefill speed

Tags

Subject

Model: Qwen3.5-27B-Q6_K Dataset: wikitext-2

Baseline Comparison

ppl_greedy 2.9x worse ppl_multi_start 2.5x worse

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

DEAD END. Single-thread greedy: +8% prefill speed but PPL 17.09 (3x worse than Viterbi's 5.83). Multi-start greedy (512 threads, take argmin best path): PPL 14.74 (still 2.5x worse) with NO speed gain — same compute as Viterbi, just trades syncthreads for parallelism. Greedy cannot match Viterbi quality because it lacks global path optimization — each step's locally optimal choice cascades into globally suboptimal trellis paths. Viterbi's O(n*S²) cost is unavoidable for quality TCQ encoding. For anyone considering fast TCQ alternatives: there is no shortcut. Invest in optimizing Viterbi itself (see EXP-0054).

prefill_greedy_delta "+8%" ppl_greedy 17.09 ppl_multi_start_512 14.74 ppl_viterbi 5.83