Greedy trellis encode (locally optimal at each step) trades quality for prefill speed
DEAD END. Single-thread greedy: +8% prefill speed but PPL 17.09 (3x worse than Viterbi's 5.83). Multi-start greedy (512 threads, take argmin best path): PPL 14.74 (still 2.5x worse) with NO speed gain — same compute as Viterbi, just trades syncthreads for parallelism. Greedy cannot match Viterbi quality because it lacks global path optimization — each step's locally optimal choice cascades into globally suboptimal trellis paths. Viterbi's O(n*S²) cost is unavoidable for quality TCQ encoding. For anyone considering fast TCQ alternatives: there is no shortcut. Invest in optimizing Viterbi itself (see EXP-0054).