signalnine CUDA PR #3 (RTX 5090 Blackwell)

success

0.14

1/5

Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

Consensus Metrics

decode_tok_s_f16 95.4 (n=1, σ=0)

decode_tok_s_q8_0 95.7 (n=1, σ=0)

decode_tok_s_turbo3 94 (n=1, σ=0)

decode_ratio_vs_f16 0.985 (n=1, σ=0)

compression_ratio 3.47 (n=1, σ=0)

Parameters

gpu_arch sm_12_0

implementation native_cuda

wht shared_memory_butterfly

norm recon_norm_correction

Hypothesis

CUDA implementation on Blackwell achieves near-parity with F16 decode

Reference

https://github.com/TheTom/llama-cpp-turboquant/pull/3

Tags

blackwell community cuda pr third-party

Subject

Model: Qwen3.5-35B-A3B-Q8_0

Baseline Comparison

decode_ratio_vs_f16 98.5% of F16 speed

Instances (1 reproduction)

apple-silicon-baselines signalnine RTX 5090

First-time contributor, built with Claude Code. 98.5% F16 decode on Blackwell. Converted to draft due to quality issues at larger context windows. Same dequant bottleneck pattern as Metal at long context.

decode_tok_s_f16 95.4 decode_tok_s_q8_0 95.7 decode_tok_s_turbo3 94.0 decode_ratio_vs_f16 0.985 compression_ratio 3.47