0xSero vLLM implementation (8x RTX 3090, Qwen3.5-35B-A3B MoE) — TurboQuant KV Cache Optimization

Consensus Metrics

kv_savings_pct 30.9 (n=1, σ=0)

niah_score 5 (n=1, σ=0)

niah_total 5 (n=1, σ=0)

prefill_tok_s_min 8238 (n=1, σ=0)

prefill_tok_s_max 9684 (n=1, σ=0)

decode_tok_s_short 131 (n=1, σ=0)

decode_tok_s_131k 98 (n=1, σ=0)

kv_cache_mb_tq_131k 521.9 (n=1, σ=0)

kv_cache_mb_baseline_131k 755.7 (n=1, σ=0)

full_attn_layers_compression 4.4 (n=1, σ=0)

cosine_sim_3bit_keys 1 (n=1, σ=0)

cosine_sim_2bit_values 0.94 (n=1, σ=0)

cosine_sim_4bit_values 0.997 (n=1, σ=0)

Show all 13 metrics

Parameters

framework vllm

gpus 8

gpu_type RTX_3090

implementation monkey_patch_triton

kept_qjl true

key_bits 3

value_bits 2

Show all 7 params

Hypothesis

TurboQuant generalizes to vLLM inference framework on multi-GPU CUDA

Reference

https://github.com/0xSero/turboquant

Tags

Subject

Model: Qwen3.5-35B-A3B

Baseline Comparison

kv_savings_pct 30.9% vs baseline decode_tok_s_131k -25% degradation at 131K

Instances (1 reproduction)

apple-silicon-baselines 0xSero 8x RTX 3090

Only 30.9% savings because Qwen3.5-35B-A3B has 30/40 linear attention layers that can't compress (architecture limitation). 10 full-attention layers compress 4.4x. Kept QJL (Algorithm 2) unlike our implementation. His 4/5 NIAH is actually 5/5 (model reformats one answer, parsing issue). Decode degrades at long context same as Metal but less severe (CUDA dequant cheaper than Metal LUT).

kv_savings_pct 30.9 niah_score 5 niah_total 5 prefill_tok_s_min 8238 prefill_tok_s_max 9684 decode_tok_s_short 131 decode_tok_s_131k 98 kv_cache_mb_tq_131k 521.9 kv_cache_mb_baseline_131k 755.7 full_attn_layers_compression 4.4 cosine_sim_3bit_keys 1.0 cosine_sim_2bit_values 0.94 cosine_sim_4bit_values 0.997