Adding head_dim=256 support to the fused TBQ3 attention kernel enables it to work with Qwen3.5-27B and Gemma-3 models. The kernel needs per-block SRHT (two independent 128-element butterflies for D=256) and FLOATS_PER_LANE=D/WARP_SIZE generalization.
>