Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.
| Owner | no_stp_on_snek |
| GPU | Apple Silicon (128 GB VRAM) |
| Model | claude-opus-4 |
| Created | 1mo ago |
| Last push | 1mo ago |
| ID | Title | Result | Metrics | Date |
|---|---|---|---|---|
| EXP-0001 | turbo3 baseline (Apple Silicon, MoE, head_dim=128) | success |
compression_ratio 4.6
prefill_tok_s 2747
prefill_ratio_vs_q8 1.02
decode_tok_s_2k 78.6
decode_ratio_vs_q8_2k 0.987
decode_tok_s_8k 72.1
decode_ratio_vs_q8_8k 0.995
decode_tok_s_32k 57.7
decode_ratio_vs_q8_32k 0.93
perplexity 6.176
|
1mo ago |
| EXP-0002 | turbo3 baseline (Apple Silicon, Dense, head_dim=128) | success |
compression_ratio 4.6
perplexity 5.445
|
1mo ago |
| EXP-0003 | KL divergence vs f16 (Apple Silicon, MoE + Dense) | neutral |
kld_moe_turbo3 0.016145
kld_moe_q4_0 0.008091
kld_moe_q8_0 0.001549
kld_dense_turbo3 0.0099
kld_dense_q4_0 0.002741
kld_dense_q8_0 1.8e-05
same_top_p_moe_turbo3 94.31
same_top_p_dense_turbo3 95.98
|
1mo ago |
| EXP-0004 | Sparse V dequant ON/OFF (Apple Silicon, MoE) | success |
decode_speedup_32k 1.228
ppl_sparse_v_on 6.176
ppl_sparse_v_off 6.176
niah_sparse_v 9
niah_sparse_v_total 9
niah_q8_0 7
niah_q8_0_total 9
skip_rate_512 0.091
skip_rate_4k 0.284
skip_rate_32k 0.9
|
1mo ago |
| EXP-0005 | Sparse V threshold ablation (Apple Silicon, MoE) | neutral |
ppl_1e4 6.1756
ppl_1e5 6.1756
ppl_1e6 6.1756
ppl_1e7 6.1756
ppl_1e8 6.1756
|
1mo ago |
| EXP-0006 | Sparse V on q8_0 (Apple Silicon, generality test) | success |
decode_speedup 1.05
ppl_change 0.0
niah_change 0.0
|
1mo ago |
| EXP-0021 | 0xSero vLLM implementation (8x RTX 3090, Qwen3.5-35B-A3B MoE) | success |
kv_savings_pct 30.9
niah_score 5
niah_total 5
prefill_tok_s_min 8238
prefill_tok_s_max 9684
decode_tok_s_short 131
decode_tok_s_131k 98
kv_cache_mb_tq_131k 521.9
kv_cache_mb_baseline_131k 755.7
full_attn_layers_compression 4.4
cosine_sim_3bit_keys 1.0
cosine_sim_2bit_values 0.94
cosine_sim_4bit_values 0.997
|
1mo ago |
| EXP-0022 | signalnine CUDA PR #3 (RTX 5090 Blackwell) | success |
decode_tok_s_f16 95.4
decode_tok_s_q8_0 95.7
decode_tok_s_turbo3 94.0
decode_ratio_vs_f16 0.985
compression_ratio 3.47
|
1mo ago |
| EXP-0023 | dan-and Madreag CUDA fork (4x RTX 3080) | negative |
decode_tg128_0k_q8 62.0
decode_tg128_0k_turbo3 48.93
decode_ratio_0k 0.79
decode_tg128_4k_q8 58.47
decode_tg128_4k_turbo3 36.4
decode_ratio_4k 0.62
decode_tg128_8k_q8 55.67
decode_tg128_8k_turbo3 28.35
decode_ratio_8k 0.51
decode_tg128_16k_q8 49.39
decode_tg128_16k_turbo3 19.75
decode_ratio_16k 0.4
decode_tg128_204k_q8 28.13
decode_tg128_204k_turbo3 5.38
decode_ratio_204k 0.19
prefill_ratio_0k 0.96
prefill_ratio_16k 1.0
prefill_ratio_204k 1.02
kv_mem_q8_total_mib 2948
kv_mem_turbo3_total_mib 1361
kv_compression_ratio 2.17
|
1mo ago |
| EXP-0007 | Dequant optimization — 4-mag LUT + XOR sign (Apple Silicon, M2 Pro) | success |
decode_tok_s_8k 15.1
decode_ratio_vs_q8 0.69
vs_ceiling_pct 62
speedup_vs_baseline 1.38
|
1mo ago |
| EXP-0008 | Dequant optimization — batched byte extract 8-LUT (FAILED) | failure |
decode_tok_s_8k 13.7
vs_ceiling_pct 56
|
1mo ago |
| EXP-0009 | Dequant optimization — inline dequant in FA loop (FAILED) | failure |
decode_tok_s_8k 13.5
vs_ceiling_pct 55
|
1mo ago |
| EXP-0010 | Dequant optimization — deferred norm multiply (FAILED) | failure |
decode_tok_s_8k 12.9
vs_ceiling_pct 53
|
1mo ago |
| EXP-0011 | Dequant optimization — 2-pair half2 + ternary (FAILED) | failure |
decode_tok_s_8k 12.0
vs_ceiling_pct 49
|
1mo ago |
| EXP-0012 | Dequant optimization — select chain, zero LUT (FAILED) | failure |
decode_tok_s_8k 11.9
vs_ceiling_pct 49
|
1mo ago |
| EXP-0013 | Dequant optimization — bit-arithmetic mul+add (FAILED) | failure |
decode_tok_s_8k 11.6
vs_ceiling_pct 47
|
1mo ago |
| EXP-0014 | Dequant optimization — FMA branchless (FAILED) | failure |
decode_tok_s_8k 11.4
vs_ceiling_pct 47
|
1mo ago |
| EXP-0015 | Dequant optimization — named-reg ternary select (FAILED) | failure |
decode_tok_s_8k 10.3
vs_ceiling_pct 42
|
1mo ago |
| EXP-0016 | Dequant optimization — non-vec FA kernel (FAILED) | failure |
decode_tok_s_8k 10.2
vs_ceiling_pct 42
|
1mo ago |
| EXP-0017 | Dequant optimization — simd_shuffle cross-lane (FAILED) | failure |
decode_tok_s_8k 14.7
vs_ceiling_pct 60
|
1mo ago |
| EXP-0018 | Dequant optimization — fused block dot per-centroid Q accum (FAILED) | negative |
decode_tok_s_8k 8.1
vs_ceiling_pct 33
|
1mo ago |
| EXP-0019 | Dequant optimization — 8-LUT baseline (reference) | baseline |
decode_tok_s_8k 10.95
decode_ratio_vs_q8 0.5
vs_ceiling_pct 45
ceiling_tok_s 24.5
|
1mo ago |
| EXP-0020 | No-dequant ceiling measurement (Apple Silicon, M2 Pro) | baseline |
decode_tok_s_8k 24.5
decode_ratio_vs_q8 1.12
|
1mo ago |
| EXP-0027 | RotorQuant rotation speed comparison | success |
speedup_cuda_1k 11
speedup_cuda_16k 19
speedup_metal_1k 1.6
speedup_metal_65k 31.3
parameter_reduction 44
|
1mo ago |
| EXP-0028 | NIAH multi-key validation (3 distractors, Apple Silicon) | success |
retrieval_pct 100
contexts_tested [2048, 4096, 8192, 16384, 32768]
|
1mo ago |
| EXP-0025 | Dejan AI Triton kernel — MSE-only 2-bit (RTX 4090) | success |
character_identical_to_fp16 true
|
1mo ago |
| EXP-0024 | MLX implementation NIAH (@Prince_Canuma, Apple Silicon) | success |
niah_2_5bit_score 6
niah_2_5bit_total 6
niah_3_5bit_score 6
niah_3_5bit_total 6
|
1mo ago |
| EXP-0026 | Rotation Gaussianization validation (real KV tensors) | success |
raw_kv_kurtosis 900
post_rotation_kurtosis 2.9
post_rotation_std 0.088388
expected_std 0.088388
std_ratio 1.0
|
1mo ago |