Moving backtrace from shared to global memory and double-buffering cost arrays reduces syncthreads in Viterbi encode
Modest but free win. Moving 32KB backtrace from shared memory to global eliminates shared memory pressure (35KB→5KB per block), and double-buffering cost arrays eliminates 2/3 of __syncthreads (384→128 per group). Byte-packed backtrace replaces nibble-packed (removes even/odd thread synchronization). PPL is BIT-EXACT — optimization only affects encode speed, not quantization decisions. TCQ overhead reduced from 4.6% to 4.1% vs non-TCQ turbo3. Fundamental limit: 4-8 thread blocks on 82 SMs = 95% GPU idle during Viterbi. No per-block optimization can overcome this serialization.