Remapping turbo3 3-bit index to {mag_idx, sign_bit} halves register LUT pressure, improving decode
No measurable speedup. Decode bottleneck is memory bandwidth, not ALU/register pressure from the LUT. Halving LUT size from 8 to 4 entries saves ~1 instruction per element but has no impact. q8_0 is 31.03 tok/s; the 3% turbo3 gap is structural memory-bandwidth overhead, not compute.