TurboQuant KV Cache Optimization

Top Experiments

Title	Result	Confidence	Repro
turbo4 K Q pre-rotation bug fix turbo4 K produces garbage because Q pre-rotation guard only checks TURBO3_0, not TURBO4_0	success	0.14	1/5
TurboQuant vs rotated q4_0/q8_0 (upstream PR #21038) turbo3 at 3.5 bpv competes with upstream rotated q4_0 at 4.5 bpv	success	0.14	1/5
Dequant optimization — non-vec FA kernel (FAILED) Forcing non-vectorized FA kernel (nl=2) improves single-token decode	failure	0.14	1/5
Native VEC decode (scalar dequant in attention kernel) Reading turbo3 directly in VEC attention kernel (scalar dequant, no fp16 buffer) saves 5x bandwidth by avoiding fp16 materialization	negative	0.14	1/5
Multi-sequence (n_seq > 1) dequant fix turbo dequant-to-fp16 kernels ignore stream dimension ne[3], causing catastrophic PPL with n_seq > 1	success	0.14	1/5