The fused kernel's Q addressing was swapped — it used q_head * (nq * D) instead of q_idx * (nh_q * D). After ggml_permute(0,2,1,3), Q physical layout is [nq, nh_q, D] (token-major) not [nh_q, nq, D] (head-major). Fixing this should make PPL match baseline.
>