Fusing activation quantization into the weight GEMM kernel eliminates separate q8_1 quantize kernel launch + L2 traffic
ALL THREE VARIANTS SLOWER. (1) f32 FMA replacing DP4A: -17.4%, becomes compute-bound without integer dot product. (2) In-register quantize + DP4A: -31.5%, SFU overhead + redundant per-block quantization (rpb=1, each block re-quantizes same activation row). (3) Shared memory cooperative quantize: -71%, phase-1 quantize overhead (940 ns/block) exceeds phase-2 MMA work (260 ns/block). The separate quantize_q8_1 kernel is already optimal — it quantizes once and distributes via L2 cache. Per-block redundant quantization always loses to quantize-once-distribute. The 2.3% kernel launch overhead between quantize and GEMM is irreducible without CUDA Graphs. LESSON: fusing two well-optimized kernels often loses when the fusion introduces redundant work.