Dual RTX 4090 multi-GPU validation — TurboQuant KV Cache Optimization

Consensus Metrics

prefill_turbo3 4987 (n=1, σ=0)

prefill_q8 5117 (n=1, σ=0)

decode_turbo3 36.27 (n=1, σ=0)

decode_q8 103.6 (n=1, σ=0)

prefill_turbo4 2542 (n=1, σ=0)

decode_turbo4 17.63 (n=1, σ=0)

Parameters

type_k turbo3

type_v turbo3

context 45000

gpus 2

Hypothesis

Multi-GPU performance matches single-GPU ratios

Tags

Subject

Model: Qwen3.5-35B-A3B-Q8_0

Baseline Comparison

prefill_turbo3 -2.5% decode_turbo3 -65%

Instances (1 reproduction)

cuda-rtx3090 community RTX 4090 x2

Community report after multi-GPU q_rot_buf per-device fix. turbo3 prefill 97.5% of q8_0 (consistent with single-GPU). Decode at 35% matches known MoE regression (attention <5% of total compute, turbo dequant overhead not amortized). turbo4 exactly 2x slower than turbo3 — needs investigation. Possible causes: (1) missing `-fa on` flag, (2) turbo4 not hitting MMA prefill path on multi-GPU, (3) QJL dequant overhead doubled vs turbo3 base. BUG FIXED in this build: q_rot_buf was single static pointer, device 1 tried to access device 0 memory. Fixed by per-device array indexed by cudaGetDevice() with GGML_CUDA_MAX_DEVICES=16 slots.

prefill_turbo3 4987 prefill_q8 5117 decode_turbo3 36.27 decode_q8 103.57 prefill_turbo4 2542 decode_turbo4 17.63