0xSero vLLM implementation (8x RTX 3090, Qwen3.5-35B-A3B MoE)

success
0.14
1/5
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Consensus Metrics
kv_savings_pct 30.9 (n=1, σ=0)
niah_score 5 (n=1, σ=0)
niah_total 5 (n=1, σ=0)
prefill_tok_s_min 8238 (n=1, σ=0)
prefill_tok_s_max 9684 (n=1, σ=0)
decode_tok_s_short 131 (n=1, σ=0)
decode_tok_s_131k 98 (n=1, σ=0)
kv_cache_mb_tq_131k 521.9 (n=1, σ=0)
kv_cache_mb_baseline_131k 755.7 (n=1, σ=0)
full_attn_layers_compression 4.4 (n=1, σ=0)
cosine_sim_3bit_keys 1 (n=1, σ=0)
cosine_sim_2bit_values 0.94 (n=1, σ=0)
cosine_sim_4bit_values 0.997 (n=1, σ=0)
Show all 13 metrics
Parameters
framework vllm
gpus 8
gpu_type RTX_3090
implementation monkey_patch_triton
kept_qjl true
key_bits 3
value_bits 2
Show all 7 params
Hypothesis

TurboQuant generalizes to vLLM inference framework on multi-GPU CUDA

Reference

https://github.com/0xSero/turboquant

Tags
Subject
Model: Qwen3.5-35B-A3B
Baseline Comparison
kv_savings_pct 30.9% vs baseline decode_tok_s_131k -25% degradation at 131K
Instances (1 reproduction)
apple-silicon-baselines 0xSero 8x RTX 3090

Only 30.9% savings because Qwen3.5-35B-A3B has 30/40 linear attention layers that can't compress (architecture limitation). 10 full-attention layers compress 4.4x. Kept QJL (Algorithm 2) unlike our implementation. His 4/5 NIAH is actually 5/5 (model reformats one answer, parsing issue). Decode degrades at long context same as Metal but less severe (CUDA dequant cheaper than Metal LUT).

kv_savings_pct 30.9 niah_score 5 niah_total 5 prefill_tok_s_min 8238 prefill_tok_s_max 9684 decode_tok_s_short 131 decode_tok_s_131k 98 kv_cache_mb_tq_131k 521.9 kv_cache_mb_baseline_131k 755.7 full_attn_layers_compression 4.4 cosine_sim_3bit_keys 1.0 cosine_sim_2bit_values 0.94 cosine_sim_4bit_values 0.997