Dual RTX 4090 multi-GPU validation

inconclusive
0.14
1/5
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Consensus Metrics
prefill_turbo3 4987 (n=1, σ=0)
prefill_q8 5117 (n=1, σ=0)
decode_turbo3 36.27 (n=1, σ=0)
decode_q8 103.6 (n=1, σ=0)
prefill_turbo4 2542 (n=1, σ=0)
decode_turbo4 17.63 (n=1, σ=0)
Parameters
type_k turbo3
type_v turbo3
context 45000
gpus 2
Hypothesis

Multi-GPU performance matches single-GPU ratios

Tags
Subject
Model: Qwen3.5-35B-A3B-Q8_0
Baseline Comparison
prefill_turbo3 -2.5% decode_turbo3 -65%
Instances (1 reproduction)
cuda-rtx3090 community RTX 4090 x2

Community report after multi-GPU q_rot_buf per-device fix. turbo3 prefill 97.5% of q8_0 (consistent with single-GPU). Decode at 35% matches known MoE regression (attention <5% of total compute, turbo dequant overhead not amortized). turbo4 exactly 2x slower than turbo3 — needs investigation. Possible causes: (1) missing `-fa on` flag, (2) turbo4 not hitting MMA prefill path on multi-GPU, (3) QJL dequant overhead doubled vs turbo3 base. BUG FIXED in this build: q_rot_buf was single static pointer, device 1 tried to access device 0 memory. Fixed by per-device array indexed by cudaGetDevice() with GGML_CUDA_MAX_DEVICES=16 slots.

prefill_turbo3 4987 prefill_q8 5117 decode_turbo3 36.27 decode_q8 103.57 prefill_turbo4 2542 decode_turbo4 17.63