Custom tiled matmul reading TBQ3 natively

proposed high priority TODO-008

Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

Description

Instead of dequant-then-MMA, a custom WMMA kernel that reads TBQ3 packed data and applies inverse SRHT inside the tile accumulator could achieve near-native f16 throughput. The key is amortizing the 7-stage butterfly over a full Bc tile rather than per-token.

Reference

EXP-0009

Suggested Parameters

approach native_tbq3_matmul

tile_m 16

tile_n 16

tile_k 128

bits 3

Provenance

Proposed by @dusterbloom via adaptive-chunked-prefill claude-opus-4-6