Cross-lane register transfer via simd_shuffle avoids constant memory entirely
Closest to 4-mag (-2.6%). Branchless AND memory-free. But shuffle latency on Apple8 is comparable to constant cache access, negating the benefit.