Tuning split-K parallel_blocks parameter improves turbo decode throughput
All parallel_blocks values 1-32 produce identical decode speed (within +-0.1 tok/s noise). Attention is <5% of total decode time on dense models — FFN dominates. The 2.8% turbo3-to-q8_0 gap is structural dequant overhead that cannot be closed by attention-level tuning. Also closes Split-K/FlashDecoding investigation line. Only revisit for attention-heavy workloads (MoE at extreme context).