Fusing TBQ3 dequant (inverse SRHT) directly into a FlashAttention-style online softmax kernel eliminates all intermediate buffers (k_tmp, v_tmp, S) while producing identical results
>