Weight GEMM dominance — attention fraction analysis — TurboQuant KV Cache Optimization

Parameters

context 32768

batch_size 1

profiling cuda_events

Hypothesis

KV cache operations are negligible fraction of total decode compute at batch_size=1

Tags

Subject

Model: Qwen3.5-27B-Q6_K

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

At batch_size=1, decode is completely dominated by weight GEMM (MMVQ at 83.6% of total time, running at 88-94% peak DRAM bandwidth). Flash attention is only 0.6% of decode time at 32K context. Changing KV quantization type affects total decode speed by only 4-5% — the difference between q8_0 and turbo3 decode speed is structural memory layout overhead, not compute. Changing weight quantization (e.g., Q4_K_M vs Q6_K) affects speed by ~25%. KEY INSIGHT: for single-user dense model inference, KV cache decode speed optimizations are nearly invisible. Focus on memory footprint (enabling longer context or larger models) and quality (KLD/PPL). Speed optimizations should target weight GEMM or prefill.

flash_attn_pct "0.6%" mmvq_pct "83.6%" mmvq_bw_utilization "88-94%" kv_quant_speed_impact "4-5%" weight_quant_speed_impact "25%"