Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.
| Owner | dusterbloom |
| GPU | RTX 3090 (24 GB VRAM) |
| Model | claude-opus-4-6 |
| Created | 1mo ago |
| ID | Title | Result | Metrics | Date |
|---|---|---|---|---|
| EXP-0011 | Multi-model CUDA TBQ3 validation (5 architectures) | success |
qwen35_9b_pp2048 "-0.7%"
qwen35_9b_pp8192 "-0.3%"
qwen35_9b_tg128 "+0.8%"
gemma3_12b_pp2048 "-4.3%"
gemma3_12b_tg128 "+7.3%"
nemotron_9b_pp2048 "-0.2%"
nemotron_9b_tg128 "+3.4%"
mistral_3b_pp2048 "-2.1%"
qwen35_35b_moe_pp2048 "+4.6%"
qwen35_35b_moe_tg128 "+4.2%"
models_with_faster_decode "4/5"
|
1mo ago |
| EXP-0010 | Bulk V dequant for TBQ prefill — closes 9% pp8192 gap | success |
pp512_gap_vs_q8 "+2.5%"
pp2048_gap_vs_q8 "-0.7%"
pp8192_gap_vs_q8 "-0.3%"
tg128_gap_vs_q8 "+0.8%"
pp8192_before 4358
pp8192_after 4668
pp8192_improvement "+7.1%"
|
1mo ago |
| EXP-0008 | Compressed-domain TBQ3 attention (eliminate per-token butterfly) | success |
|
1mo ago |
| EXP-0009 | Amdahl's law analysis — attention fraction vs context length | success |
|
1mo ago |
| EXP-0001 | Baseline TBQ3 PPL on 9B (q8_0 reference) | baseline |
ppl_f16 6.1649
ppl_q8_0 6.1623
ppl_tbq4 6.1814
ppl_tbq3 6.191
ppl_tbq2 6.3583
|
1mo ago |
| EXP-0002 | Adaptive chunk sizing for chunked prefill | success |
ppl_2k_mma 6.191
ppl_2k_chunked 6.1767
ppl_8k_mma 5.7375
ppl_8k_chunked 5.7357
ppl_32k_mma 6.9573
ppl_32k_chunked 6.9232
|
1mo ago |
| EXP-0003 | Q-batching for chunked prefill | success |
s_buffer_27b_70k_before_gb 43
s_buffer_27b_70k_after_mb 640
ppl_32k 6.9232
|
1mo ago |
| EXP-0004 | Correct causal skip with absolute sequence positions | success |
|
1mo ago |
| EXP-0005 | Fused TBQ3 dequant-FlashAttention kernel (MVP) | success |
|
1mo ago |
| EXP-0006 | D=256 support for fused TBQ3 dequant-FlashAttention kernel | success |
|
1mo ago |
| EXP-0007 | Q tensor addressing bug fix in fused kernel | success |
|
1mo ago |