Multi-GPU performance matches single-GPU ratios
Community report after multi-GPU q_rot_buf per-device fix. turbo3 prefill 97.5% of q8_0 (consistent with single-GPU). Decode at 35% matches known MoE regression (attention <5% of total compute, turbo dequant overhead not amortized). turbo4 exactly 2x slower than turbo3 — needs investigation. Possible causes: (1) missing `-fa on` flag, (2) turbo4 not hitting MMA prefill path on multi-GPU, (3) QJL dequant overhead doubled vs turbo3 base. BUG FIXED in this build: q_rot_buf was single static pointer, device 1 tried to access device 0 memory. Fixed by per-device array indexed by cudaGetDevice() with GGML_CUDA_MAX_DEVICES=16 slots.