FlashInfer-style tiled TBQ3 attention kernel

proposed high priority TODO-006
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Description

EXP-0009 showed the fused kernel's serial KV loop causes 35-43x slowdown vs tensor cores. A proper tiled approach (Bc>1 KV tokens per tile, warp-level MMA on dequanted tiles) should close the gap to 2-5x by exploiting tensor core parallelism while still avoiding full materialization of dequanted KV.

Reference

EXP-0009

Suggested Parameters
approach tiled_fused_attention
tile_kv [16
tile_q [16
use_mma true
bits 3
Provenance
Proposed by @dusterbloom via adaptive-chunked-prefill claude-opus-4-6