Fused weight quantize + GEMM (speed) — TurboQuant KV Cache Optimization

Consensus Metrics

decode_baseline 30 (n=1, σ=0)

decode_f32_fma 24.82 (n=1, σ=0)

decode_inreg_dp4a 20.57 (n=1, σ=0)

decode_smem_dp4a 8.7 (n=1, σ=0)

Parameters

variants [f32_fma

Hypothesis

Fusing activation quantization into the weight GEMM kernel eliminates separate q8_1 quantize kernel launch + L2 traffic

Tags

Subject

Model: Qwen3.5-27B-Q6_K

Baseline Comparison

f32_fma -17.4% inreg_dp4a -31.5% smem_dp4a -71.0%

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

ALL THREE VARIANTS SLOWER. (1) f32 FMA replacing DP4A: -17.4%, becomes compute-bound without integer dot product. (2) In-register quantize + DP4A: -31.5%, SFU overhead + redundant per-block quantization (rpb=1, each block re-quantizes same activation row). (3) Shared memory cooperative quantize: -71%, phase-1 quantize overhead (940 ns/block) exceeds phase-2 MMA work (260 ns/block). The separate quantize_q8_1 kernel is already optimal — it quantizes once and distributes via L2 cache. Per-block redundant quantization always loses to quantize-once-distribute. The 2.3% kernel launch overhead between quantize and GEMM is irreducible without CUDA Graphs. LESSON: fusing two well-optimized kernels often loses when the fusion introduces redundant work.

decode_baseline 30.0 decode_f32_fma 24.82 decode_inreg_dp4a 20.57 decode_smem_dp4a 8.7