Bulk dequant turbo KV to fp16 then use MMA tensor core kernel for prefill
1.78x prefill speedup, turbo3 now at 98.8% of q8_0 prefill. Implementation: when Q->ne[1] > 1 (prefill), allocate temp fp16 buffers via cudaMalloc (NOT cudaMallocAsync — causes NaN on CUDA graph replay), bulk-dequant turbo K/V to fp16, then dispatch standard MMA tensor core kernel. During decode (Q->ne[1]==1), use existing vec kernel with inline dequant. Memory overhead ~16MB per head group (temporary, freed after attention). IMPORTANT: only enabled for turbo3, NOT turbo4 — QJL correction loses ~1% PPL through fp16 round-trip (10-bit mantissa rounds away ~0.001 magnitude QJL adjustments). turbo4 prefill later enabled separately with explicit quality tradeoff acceptance (EXP-0016b).