EXP-0009 showed the fused kernel's serial KV loop causes 35-43x slowdown vs tensor cores. A proper tiled approach (Bc>1 KV tokens per tile, warp-level MMA on dequanted tiles) should close the gap to 2-5x by exploiting tensor core parallelism while still avoiding full materialization of dequanted KV.
EXP-0009