Compressed-domain TBQ3 attention (eliminate per-token butterfly)

success
0.08
1/5
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Hypothesis

Since the SRHT (H, diag(r)) is fixed across all tokens, we can pre-rotate Q with forward SRHT once, compute K scores directly against centroids (no K butterfly), accumulate V in compressed domain (no V butterfly), and apply ONE inverse SRHT at the end. This eliminates 14 butterfly stages per KV token from the inner loop, reducing per-token compute by ~8x.

Tags
Dependencies
Instances (1 reproduction)