Skipping entire KV tiles when all QK scores are far below running max reduces V dequant work at long context
TensorRT-LLM XQA kernel
Correctly implemented tile-level skipping (threshold=20, exp(-20)≈2e-9) but 0% improvement because attention is <1% of decode time on Qwen3.5-27B (4 KV heads). Even at 65K context, only 0.3% of decode is attention — weight GEMM dominates >99%. The optimization IS valid and would show gains on models with many KV heads, MoE models (cheaper FFN → larger attention fraction), or 100K+ context. Anyone benchmarking attention kernel optimizations on dense models at ≤65K will see zero improvement regardless of the optimization quality.