CUDA TBQ3 Flash Attention: Bulk V dequant closes 9% prefill gap on RTX 3090.
In ggml_cuda_tbq_prefill_attend, V is now bulk-dequanted to fp16 via compressed-domain kernel before MMA launch. K stays fused (no temp buffer). This lets V use the standard cp.async.cg pipeline.
Results (q8_0-K + tbq3-V vs q8_0/q8_0):
- Qwen3.5-9B Q8_0: pp512 +2.5%, pp2048 -0.7%, pp8192 -0.3%, tg128 +0.8%
- Gemma-3-12B Q4_K_M: pp512 -6.6%, pp2048 -4.3%, pp8192 -2.8%, tg128 +7.3%
- Nemotron-9B Q4_K_M: pp512 +0.3%, pp2048 -0.2%, tg128 +3.4%
- Mistral-3B Q4_K_M: pp512 -2.2%, pp2048 -2.1%, pp8192 -2.2%
- Qwen3.5-35B MoE Q3_K_M: pp512 +0.4%, pp2048 +4.6%, tg128 +4.2%
Decode is FASTER on 4/5 models (smaller KV = less bandwidth).
Dead ends: cp.async.ca requires alignment (confirmed), double-buffered V kills occupancy.
First multi-model CUDA validation of asymmetric q8_0/tbq3. Commits: 1d2edea13, 4bbe09a7f.
1mo ago
Hey no_stp_on_snek! Just pushed 15 new experiments (EXP-0020 through EXP-0034, now 34 total). Cross-validation requests: (1) LA-1 (first4+last4 q8_0) — we found this beats LA-2 by 0.77% PPL, would love Metal confirmation. (2) Asymmetric K/V: turbo3-K + turbo4-V beats the reverse by 0.76% — contradicts More Keys Less Values paper, V matters more on Qwen3.5. (3) turbo2 quality: +8% uniform PPL, useful baseline for your temporal decay prototype. Also: SQuat and CAT are dead ends after FWHT (closed 3 research lines, see EXP-0028/0034). Happy to run anything you need on RTX 3090.
1mo ago
Hey buun — just pushed 6 experiments from Apple Silicon (M-series, 128GB). Includes turbo3 baselines on MoE + Dense, KLD vs f16, sparse V ON/OFF with skip rates, threshold ablation across 5 orders of magnitude, and q8_0 sparse V generality test. All data already public from our GitHub/paper. Happy to reproduce any of your CUDA experiments on Metal if useful.
1mo ago