dan-and Madreag CUDA fork (4x RTX 3080)

negative
0.14
1/5
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Consensus Metrics
decode_tg128_0k_q8 62 (n=1, σ=0)
decode_tg128_0k_turbo3 48.93 (n=1, σ=0)
decode_ratio_0k 0.79 (n=1, σ=0)
decode_tg128_4k_q8 58.47 (n=1, σ=0)
decode_tg128_4k_turbo3 36.4 (n=1, σ=0)
decode_ratio_4k 0.62 (n=1, σ=0)
decode_tg128_8k_q8 55.67 (n=1, σ=0)
decode_tg128_8k_turbo3 28.35 (n=1, σ=0)
decode_ratio_8k 0.51 (n=1, σ=0)
decode_tg128_16k_q8 49.39 (n=1, σ=0)
decode_tg128_16k_turbo3 19.75 (n=1, σ=0)
decode_ratio_16k 0.4 (n=1, σ=0)
decode_tg128_204k_q8 28.13 (n=1, σ=0)
decode_tg128_204k_turbo3 5.38 (n=1, σ=0)
decode_ratio_204k 0.19 (n=1, σ=0)
prefill_ratio_0k 0.96 (n=1, σ=0)
prefill_ratio_16k 1 (n=1, σ=0)
prefill_ratio_204k 1.02 (n=1, σ=0)
kv_mem_q8_total_mib 2948 (n=1, σ=0)
kv_mem_turbo3_total_mib 1361 (n=1, σ=0)
kv_compression_ratio 2.17 (n=1, σ=0)
Show all 21 metrics
Parameters
framework llama.cpp
gpus 4
gpu_type RTX_3080
implementation madreag_turbo3_cuda
context 260000
Hypothesis

Madreag CUDA fork maintains decode performance at long context

Reference

https://github.com/TheTom/llama-cpp-turboquant/issues/3

Tags
Subject
Model: Qwen3.5-35B-A3B-Q8_K_XL Dataset: llama-benchy
Baseline Comparison
decode_ratio_204k 0.19x q8_0 — unusable at long context
Instances (1 reproduction)
apple-silicon-baselines dan-and 4x RTX 3080

Decode falls off a cliff at long context (0.19x at 204K). Same dequant bottleneck as Metal but far worse — Madreag CUDA kernel unoptimized. Prefill fine (~1.0x). KV compression only 2.17x due to Qwen3.5-35B-A3B 30/40 linear attention layers not compressing.

decode_tg128_0k_q8 62.0 decode_tg128_0k_turbo3 48.93 decode_ratio_0k 0.79 decode_tg128_4k_q8 58.47 decode_tg128_4k_turbo3 36.4 decode_ratio_4k 0.62 decode_tg128_8k_q8 55.67 decode_tg128_8k_turbo3 28.35 decode_ratio_8k 0.51 decode_tg128_16k_q8 49.39 decode_tg128_16k_turbo3 19.75 decode_ratio_16k 0.4 decode_tg128_204k_q8 28.13 decode_tg128_204k_turbo3 5.38 decode_ratio_204k 0.19 prefill_ratio_0k 0.96 prefill_ratio_16k 1.0 prefill_ratio_204k 1.02 kv_mem_q8_total_mib 2948 kv_mem_turbo3_total_mib 1361 kv_compression_ratio 2.17