Capturing decode step as CUDA Graph eliminates per-kernel launch overhead (hundreds of launches per token)
NVIDIA upstream llama.cpp contribution
SUCCESS. +3.1% decode by eliminating kernel launch overhead. NVIDIA contributed basic CUDA Graphs to upstream llama.cpp (10-15% reported on their hardware). Our gain is lower because RTX 3090 decode is more memory-bound (less CPU overhead relative to GPU work). Graph captures the full decode step — each token reuses the same graph with updated pointers. Main limitation: llama.cpp rebuilds graph dynamically per token for variable shapes, limiting reuse. SGLang achieves higher gains (10-15%) with piecewise per-layer graphs and multi-stream capture.