Instead of dequant-then-MMA, a custom WMMA kernel that reads TBQ3 packed data and applies inverse SRHT inside the tile accumulator could achieve near-native f16 throughput. The key is amortizing the 7-stage butterfly over a full Bc tile rather than per-token.
EXP-0009