CUDA implementation on Blackwell achieves near-parity with F16 decode
First-time contributor, built with Claude Code. 98.5% F16 decode on Blackwell. Converted to draft due to quality issues at larger context windows. Same dequant bottleneck pattern as Metal at long context.