Since the SRHT (H, diag(r)) is fixed across all tokens, we can pre-rotate Q with forward SRHT once, compute K scores directly against centroids (no K butterfly), accumulate V in compressed domain (no V butterfly), and apply ONE inverse SRHT at the end. This eliminates 14 butterfly stages per KV token from the inner loop, reducing per-token compute by ~8x.
>