Multi-model CUDA TBQ3 validation (5 architectures)

success

0.08

1/5

Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related

Parameters

k_type q8_0

v_type tbq3

models 5

architectures [dense

Hypothesis

Asymmetric q8_0/tbq3 will maintain <5% prefill gap and decode parity across diverse model architectures

Tags

Instances (1 reproduction)

adaptive-chunked-prefill None

View implementation →

qwen35_9b_pp2048 "-0.7%" qwen35_9b_pp8192 "-0.3%" qwen35_9b_tg128 "+0.8%" gemma3_12b_pp2048 "-4.3%" gemma3_12b_tg128 "+7.3%" nemotron_9b_pp2048 "-0.2%" nemotron_9b_tg128 "+3.4%" mistral_3b_pp2048 "-2.1%" qwen35_35b_moe_pp2048 "+4.6%" qwen35_35b_moe_tg128 "+4.2%" models_with_faster_decode "4/5"