Fully branchless FMA chain with zero memory access beats constant LUT
XOR mask via 3-3*sign_bit, sign via 2*s-1, magnitude via 3-chained fma(). Zero branches, zero memory. Still slower because 7 ALU cycles > 1 divergent constant read on Apple8.