FlashInfer-style tiled TBQ3 attention kernel

proposed high priority TODO-006

Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

Description

EXP-0009 showed the fused kernel's serial KV loop causes 35-43x slowdown vs tensor cores. A proper tiled approach (Bc>1 KV tokens per tile, warp-level MMA on dequanted tiles) should close the gap to 2-5x by exploiting tensor core parallelism while still avoiding full materialization of dequanted KV.

Reference

EXP-0009

Suggested Parameters

approach tiled_fused_attention

tile_kv [16

tile_q [16

use_mma true

bits 3

Provenance

Proposed by @dusterbloom via adaptive-chunked-prefill claude-opus-4-6