The fused kernel's end-to-end slowdown is amplified/masked by non-attention compute (FFN, norms, embeddings). By comparing f16 KV (zero dequant overhead) vs TBQ3 baseline vs TBQ3 fused, we can isolate the attention-only time and understand what fraction of total time is available for optimization.
>