Inlining dequant into flash attention loop eliminates function call overhead
I-cache pressure from expanded inline code. Worse than 4-mag.