MMVQ kernel profiling — hardware bandwidth wall — TurboQuant KV Cache Optimization

Consensus Metrics

registers_per_thread 40 (n=1, σ=0)

active_warps 46 (n=1, σ=0)

Parameters

profiler ncu

layers [17408x4096

Hypothesis

ncu profiling reveals whether MMVQ weight GEMM kernel has optimization headroom

Tags

Subject

Model: Qwen3.5-27B-Q6_K

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

MMVQ kernel is at 88-94% peak DRAM bandwidth for large layers (17408, 12288, 5120 rows). Only small layers (1024 rows) are at 50% due to tail effects (not enough work to fill all SMs). No fp16 intermediate buffer — current kernel already does in-register dequant with DP4A (integer dot product). 40 registers/thread, 46 active warps. ExLlamaV3's reported 85% advantage over llama.cpp comes from full-stack optimization (persistent kernels, fused ops across layers, custom memory management), not from a better weight GEMM kernel. The per-kernel headroom for weight GEMM on RTX 3090 is 0-5% at best. HARDWARE WALL REACHED for single-kernel decode speed optimization.

bw_utilization_large "88-94%" bw_utilization_small "50%" registers_per_thread 40 active_warps 46 bottleneck "DRAM bandwidth"