Forcing non-vectorized FA kernel (nl=2) improves single-token decode
Non-vec kernel processes batch=1 inefficiently at long context. Wrong kernel for this workload.