Loading 2KB TCQ codebook from __constant__ memory into shared memory gives 32-bank parallel access, improving decode throughput
0% improvement. __constant__ cache is NOT a bottleneck — 2KB codebook fits entirely in 64KB constant cache on Ampere. 128 threads hitting different entries do NOT serialize as feared (32B broadcast granularity is sufficient for 4-byte floats when access pattern has temporal locality). Confirms finding from competitor (TheTom): bottleneck is HOW MANY values are dequantized, not HOW. For anyone implementing codebook-based KV quantization, constant memory is optimal for codebooks ≤64KB.