Multi-model CUDA TBQ3 validation (5 architectures)

success
0.08
1/5
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Parameters
k_type q8_0
v_type tbq3
models 5
architectures [dense
Hypothesis

Asymmetric q8_0/tbq3 will maintain <5% prefill gap and decode parity across diverse model architectures

Tags
Instances (1 reproduction)
adaptive-chunked-prefill None

>

View implementation →
qwen35_9b_pp2048 "-0.7%" qwen35_9b_pp8192 "-0.3%" qwen35_9b_tg128 "+0.8%" gemma3_12b_pp2048 "-4.3%" gemma3_12b_tg128 "+7.3%" nemotron_9b_pp2048 "-0.2%" nemotron_9b_tg128 "+3.4%" mistral_3b_pp2048 "-2.1%" qwen35_35b_moe_pp2048 "+4.6%" qwen35_35b_moe_tg128 "+4.2%" models_with_faster_decode "4/5"