cuBLAS GEMM for flash attention prefill — TurboQuant KV Cache Optimization

Consensus Metrics

prefill_fused_mma 1125 (n=1, σ=0)

Parameters

prefill_path [fused_mma

Hypothesis

Chunked cuBLAS GEMM for Q*K and score*V during prefill outperforms fused flash attention MMA

Tags

Subject

Model: Qwen3.5-27B-Q6_K

Baseline Comparison

prefill 1-5% SLOWER with cuBLAS

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

Fused flash attention MMA beats cuBLAS GEMM for attention because it avoids materializing the O(n_q × n_kv) score matrix in global memory. cuBLAS computes Q*K^T as explicit GEMM (writes n_q × n_kv matrix to HBM), then reads it back for softmax, then writes softmax output, then reads it back for score*V GEMM. Flash attention fuses all three operations, keeping scores in shared memory/registers. For attention specifically, the fusion opportunity is too large for cuBLAS to overcome with raw GEMM speed. LESSON: do not replace fused flash attention with separate cuBLAS GEMMs, even though cuBLAS is well-optimized for general matrix multiplication.

prefill_fused_mma 1125 prefill_cublas "1-5% slower"