Different tensor roles have different per-channel variance distributions (q/k/v see different upstream activations than gate/up/down). A single global α may be suboptimal — per-role α should improve every role independently
EXP-0012, EXP-0007