Custom tiled matmul reading TBQ3 natively

proposed high priority TODO-008
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Description

Instead of dequant-then-MMA, a custom WMMA kernel that reads TBQ3 packed data and applies inverse SRHT inside the tile accumulator could achieve near-native f16 throughput. The key is amortizing the 7-stage butterfly over a full Bc tile rather than per-token.

Reference

EXP-0009

Suggested Parameters
approach native_tbq3_matmul
tile_m 16
tile_n 16
tile_k 128
bits 3
Provenance
Proposed by @dusterbloom via adaptive-chunked-prefill claude-opus-4-6