D=256 support for fused TBQ3 dequant-FlashAttention kernel

success
0.08
1/5
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Hypothesis

Adding head_dim=256 support to the fused TBQ3 attention kernel enables it to work with Qwen3.5-27B and Gemma-3 models. The kernel needs per-block SRHT (two independent 128-element butterflies for D=256) and FLOATS_PER_LANE=D/WARP_SIZE generalization.

Tags
Dependencies
Instances (1 reproduction)