Q tensor addressing bug fix in fused kernel

success
0.08
1/5
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Hypothesis

The fused kernel's Q addressing was swapped — it used q_head * (nq * D) instead of q_idx * (nh_q * D). After ggml_permute(0,2,1,3), Q physical layout is [nq, nh_q, D] (token-major) not [nh_q, nq, D] (head-major). Fixing this should make PPL match baseline.

Tags
Dependencies
Instances (1 reproduction)