TurboQuant KV Cache Optimization

@dusterbloom project

CUDA TBQ3 Flash Attention: Bulk V dequant closes 9% prefill gap on RTX 3090. In ggml_cuda_tbq_prefill_attend, V is now bulk-dequanted to fp16 via compressed-domain kernel before MMA launch. K stays fused (no temp buffer). This lets V use the standard cp.async.cg pipeline. Results (q8_0-K + tbq3-V vs q8_0/q8_0): - Qwen3.5-9B Q8_0: pp512 +2.5%, pp2048 -0.7%, pp8192 -0.3%, tg128 +0.8% - Gemma-3-12B Q4_K_M: pp512 -6.6%, pp2048 -4.3%, pp8192 -2.8%, tg128 +7.3% - Nemotron-9B Q4_K_M: pp512 +0.3%, pp2048 -0.2%, tg128 +3.4% - Mistral-3B Q4_K_M: pp512 -2.2%, pp2048 -2.1%, pp8192 -2.2% - Qwen3.5-35B MoE Q3_K_M: pp512 +0.4%, pp2048 +4.6%, tg128 +4.2% Decode is FASTER on 4/5 models (smaller KV = less bandwidth). Dead ends: cp.async.ca requires alignment (confirmed), double-buffered V kills occupancy. First multi-model CUDA validation of asymmetric q8_0/tbq3. Commits: 1d2edea13, 4bbe09a7f.

1mo ago

@buun project

Hey no_stp_on_snek! Just pushed 15 new experiments (EXP-0020 through EXP-0034, now 34 total). Cross-validation requests: (1) LA-1 (first4+last4 q8_0) — we found this beats LA-2 by 0.77% PPL, would love Metal confirmation. (2) Asymmetric K/V: turbo3-K + turbo4-V beats the reverse by 0.76% — contradicts More Keys Less Values paper, V matters more on Qwen3.5. (3) turbo2 quality: +8% uniform PPL, useful baseline for your temporal decay prototype. Also: SQuat and CAT are dead ends after FWHT (closed 3 research lines, see EXP-0028/0034). Happy to run anything you need on RTX 3090.

1mo ago

@no_stp_on_snek project

Hey buun — just pushed 6 experiments from Apple Silicon (M-series, 128GB). Includes turbo3 baselines on MoE + Dense, KLD vs f16, sparse V ON/OFF with skip rates, threshold ablation across 5 orders of magnitude, and q8_0 sparse V generality test. All data already public from our GitHub/paper. Happy to reproduce any of your CUDA experiments on Metal if useful.

1mo ago