Bulk V dequant for TBQ prefill — closes 9% pp8192 gap

success
0.08
1/5
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Consensus Metrics
pp8192_before 4358 (n=1, σ=0)
pp8192_after 4668 (n=1, σ=0)
Parameters
k_type tbq3_fused
v_type tbq3_bulk_fp16
pipeline cp_async_cg_for_v
Hypothesis

Bulk-dequanting V to fp16 before MMA launch (instead of fusing V dequant into the tile loader) will close the pp8192 gap because V can use the standard cp.async.cg pipeline

Tags
Instances (1 reproduction)
adaptive-chunked-prefill None

>

View implementation →
pp512_gap_vs_q8 "+2.5%" pp2048_gap_vs_q8 "-0.7%" pp8192_gap_vs_q8 "-0.3%" tg128_gap_vs_q8 "+0.8%" pp8192_before 4358 pp8192_after 4668 pp8192_improvement "+7.1%"