KV cache operations are negligible fraction of total decode compute at batch_size=1
At batch_size=1, decode is completely dominated by weight GEMM (MMVQ at 83.6% of total time, running at 88-94% peak DRAM bandwidth). Flash attention is only 0.6% of decode time at 32K context. Changing KV quantization type affects total decode speed by only 4-5% — the difference between q8_0 and turbo3 decode speed is structural memory layout overhead, not compute. Changing weight quantization (e.g., Q4_K_M vs Q6_K) affects speed by ~25%. KEY INSIGHT: for single-user dense model inference, KV cache decode speed optimizations are nearly invisible. Focus on memory footprint (enabling longer context or larger models) and quality (KLD/PPL). Speed optimizations should target weight GEMM or prefill.