ncu profiling reveals whether MMVQ weight GEMM kernel has optimization headroom
MMVQ kernel is at 88-94% peak DRAM bandwidth for large layers (17408, 12288, 5120 rows). Only small layers (1024 rows) are at 50% due to tail effects (not enough work to fill all SMs). No fp16 intermediate buffer — current kernel already does in-register dequant with DP4A (integer dot product). 40 registers/thread, 46 active warps. ExLlamaV3's reported 85% advantage over llama.cpp comes from full-stack optimization (persistent kernels, fused ops across layers, custom memory management), not from a better weight GEMM kernel. The per-kernel headroom for weight GEMM on RTX 3090 is 0-5% at best. HARDWARE WALL REACHED for single-kernel decode speed optimization.