Deferring norm multiply to after LUT lookup improves ILP
Loses ILP — per-element norm multiply actually hides constant memory latency. Removing it makes things worse.