Chunked cuBLAS GEMM for Q*K and score*V during prefill outperforms fused flash attention MMA
Fused flash attention MMA beats cuBLAS GEMM for attention because it avoids materializing the O(n_q × n_kv) score matrix in global memory. cuBLAS computes Q*K^T as explicit GEMM (writes n_q × n_kv matrix to HBM), then reads it back for softmax, then writes softmax output, then reads it back for score*V GEMM. Flash attention fuses all three operations, keeping scores in shared memory/registers. For attention specifically, the fusion opportunity is too large for cuBLAS to overcome with raw GEMM speed. LESSON: do not replace fused flash attention with separate cuBLAS GEMMs, even though cuBLAS is well-optimized for general matrix multiplication.