Processing queries in batches reduces S buffer from O(nh_q*nq*chunk) to O(nh_q*q_batch*chunk)
K-outer Q-inner loop dequants K/V once per KV chunk. Causal skip optimization removed — q_start is batch-local, not absolute sequence position, so q_start+q_len<=kv_start comparison was wrong. Mask handles causality. PPL identical to non-batched chunked path.