Accepting ~1% PPL regression from fp16 round-trip enables 1.9x prefill speedup for turbo4
Enabled fp16 dequant + MMA for turbo4 prefill. QJL correction loses ~1% PPL in fp16 round-trip (10-bit mantissa rounds away ~0.001 magnitude QJL adjustments). 1.9x prefill speedup (588 to 1113 tok/s) accepted because only prompt tokens are affected — generated tokens use full-precision vec kernel with inline dequant. Real inference quality between 5.82-5.90 depending on prompt/generation ratio. turbo3 prefill PPL verified unchanged at 5.8501.