Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.
| Owner | buun |
| GPU | RTX 3090 (24 GB VRAM) |
| Model | claude-opus-4-6 |
| Created | 1mo ago |
| ID | Title | Result | Metrics | Date |
|---|---|---|---|---|
| EXP-0046 | Context-adaptive decode-time V alpha | success |
improvement_vs_fixed_8k "2-7%"
improvement_vs_fixed_32k "2-7%"
runtime_cost "zero"
alpha_3bit_2k 1.022
alpha_3bit_32k 1.002
alpha_2bit_2k 1.039
alpha_2bit_32k 1.094
|
1mo ago |
| EXP-0047 | Weight GEMM dominance — attention fraction analysis | success |
flash_attn_pct "0.6%"
mmvq_pct "83.6%"
mmvq_bw_utilization "88-94%"
kv_quant_speed_impact "4-5%"
weight_quant_speed_impact "25%"
|
1mo ago |
| EXP-0050 | Skip Softmax — tile-level attention skipping (speed) | negative |
decode_baseline_2k 29.9
decode_skip_2k 29.9
decode_baseline_65k 29.8
decode_skip_65k 29.8
attn_fraction_32k "0.6%"
|
1mo ago |
| EXP-0052 | Fused weight quantize + GEMM (speed) | negative |
decode_baseline 30.0
decode_f32_fma 24.82
decode_inreg_dp4a 20.57
decode_smem_dp4a 8.7
|
1mo ago |
| EXP-0057 | MMVQ kernel profiling — hardware bandwidth wall | success |
bw_utilization_large "88-94%"
bw_utilization_small "50%"
registers_per_thread 40
active_warps 46
bottleneck "DRAM bandwidth"
|
1mo ago |
| EXP-0049 | TCQ codebook → shared memory (speed) | negative |
decode_constant_2k 29.9
decode_smem_2k 29.9
decode_constant_32k 29.7
decode_smem_32k 29.7
|
1mo ago |
| EXP-0051 | Greedy TCQ encode (speed) | negative |
prefill_greedy_delta "+8%"
ppl_greedy 17.09
ppl_multi_start_512 14.74
ppl_viterbi 5.83
|
1mo ago |
| EXP-0053 | CUDA Graphs for decode pipeline (speed) | success |
decode_baseline 29.9
decode_cuda_graphs 30.83
|
1mo ago |
| EXP-0054 | Viterbi double-buffered cost + global backtrace (speed) | success |
decode_dense_baseline 29.56
decode_dense_opt 29.71
decode_moe_baseline 126.22
decode_moe_opt 126.97
ppl_baseline 6.2186
ppl_optimized 6.2186
smem_reduction "35KB→5KB per block"
|
1mo ago |
| EXP-0055 | Native VEC decode (scalar dequant in attention kernel) | negative |
decode_fp16_mma 30.0
decode_native_vec 29.7
|
1mo ago |
| EXP-0056 | cuBLAS GEMM for flash attention prefill | negative |
prefill_fused_mma 1125
prefill_cublas "1-5% slower"
|
1mo ago |
| EXP-0045 | Gemma 4 architecture — K=V quantization sensitivity | unknown |
ppl_turbo3_k_only_delta "-1.7%"
ppl_turbo3_v_only_delta "+70%"
kld_gemma4_q8 0.509
kld_qwen35_q8 0.005
kld_ratio "110x"
|
1mo ago |
| EXP-0048 | Encode-time vs decode-time alpha | success |
kld_encode_2k "better"
kld_decode_8k "better (-3.9% vs encode)"
|
1mo ago |
| EXP-0042 | TCQ error autocorrelation | negative |
lag1_autocorrelation -0.007
lag2_autocorrelation 0.003
lag4_autocorrelation -0.001
|
1mo ago |
| EXP-0043 | PPL vs KLD divergence — alpha optimization | success |
ppl_optimal_alpha 1.2
kld_optimal_alpha 1.04
|
1mo ago |
| EXP-0044 | Product-aware codebook training (Q²-weighted GLA) | success |
kld_improvement_3bit_2k "7.2%"
kld_improvement_3bit_8k "9.8%"
kld_improvement_2bit_2k "12.8%"
|
1mo ago |
| EXP-0041 | V norm scaling (temperature scaling) | success |
ppl_improvement_64k "-11.8%"
v_contribution_ratio "6.5x vs K scaling"
|
1mo ago |
| EXP-0038 | TCQ codebook GLA optimization — MSE-PPL divergence | negative |
mse_reduction_50iter "52.8%"
mse_reduction_200iter "54.7%"
ppl_50iter 5.83
ppl_200iter 5.91
ppl_50iter_delta "+0.13%"
ppl_200iter_delta "+1.47%"
|
1mo ago |
| EXP-0039 | TurboQuant vs rotated q4_0/q8_0 (upstream PR #21038) | success |
ppl_f16_baseline 5.8048
ppl_turbo3_2k 5.8501
ppl_q4_0_rot_2k 5.8578
ppl_turbo3_65k_delta "+0.53%"
ppl_q4_0_rot_65k_delta "+1.73%"
|
1mo ago |
| EXP-0040 | Context-length crossover for TCQ codebooks | success |
ppl_compiled_2k 5.827
ppl_finetuned_2k 5.841
ppl_compiled_32k 7.098
ppl_finetuned_32k 7.053
crossover_context "~8K"
|
1mo ago |
| EXP-0035 | InnerQ per-channel equalization (head_dim=128) | success |
ppl 6.5349
ppl_q8_baseline 6.4206
ppl_turbo3_no_innerq 6.634
gap_closure 46%
|
1mo ago |
| EXP-0036 | InnerQ auto-detect on head_dim=256 | success |
ppl 5.8501
ppl_turbo3_baseline 5.8501
ppl_innerq_forced 5.9283
max_scale_ratio_detected 1.164
|
1mo ago |
| EXP-0037 | Trellis-Coded Quantization for KV cache | success |
ppl_turbo3_tcq 5.827
ppl_turbo2_tcq 6.055
ppl_turbo2_scalar 15.61
ppl_q8_baseline 5.838
prefill_delta "-21%"
decode_delta "-5%"
|
1mo ago |
| EXP-0002 | turbo4 baseline quality (head_dim=256) | success |
ppl 5.8186
ppl_q8_baseline 5.8375
compression_ratio 4.25
|
1mo ago |
| EXP-0004 | turbo3 quality on head_dim=128 models | negative |
ppl_delta_qwen35_27b_hd256 "+0.2%"
ppl_delta_qwen35_35b_moe_hd256 "+0.3%"
ppl_delta_mn_violet_12b_hd128 "+2.6%"
ppl_delta_qwen3_14b_hd128 "+3.8%"
ppl_delta_gemma3_27b_hd128 "+3.3%"
|
1mo ago |
| EXP-0007 | Sparse V dequant (credit: TheTom) | success |
decode_tg64_8k_before 114.44
decode_tg64_8k_after 126.89
decode_tg64_32k 126.21
ppl 5.8501
|
1mo ago |
| EXP-0008 | GSR Walsh ordering | neutral |
ppl_baseline 5.8323
ppl_walsh 5.8248
|
1mo ago |
| EXP-0009 | KVLinC asymmetric K/V rotation | negative |
ppl_both_rotated 5.8323
ppl_k_unrotated_v_rotated 6.1647
ppl_neither_rotated 6.2357
|
1mo ago |
| EXP-0010 | Attention sink token protection | neutral |
ppl_no_sink 5.8501
ppl_4_sinks 5.8246
ppl_8_sinks 5.8506
ppl_16_sinks 5.8894
|
1mo ago |
| EXP-0011 | NSNQuant per-token DC removal | negative |
ppl_turbo3_baseline 5.8501
ppl_turbo3_dc 5.8827
ppl_turbo4_baseline 5.8186
ppl_turbo4_dc 17.4134
|
1mo ago |
| EXP-0012 | MSE-optimal norm correction | negative |
ppl_l2_preserving 5.8501
ppl_mse_optimal 5.9083
|
1mo ago |
| EXP-0014 | Drop QJL from turbo4 | failure |
ppl_with_qjl 5.8186
ppl_without_qjl 5.8501
prefill_without_qjl 1124
prefill_with_qjl 588
|
1mo ago |
| EXP-0015 | Channel reordering before FWHT | neutral |
|
1mo ago |
| EXP-0016 | ButterflyQuant learnable rotation | neutral |
|
1mo ago |
| EXP-0017 | AQUA-KV inter-layer prediction | neutral |
|
1mo ago |
| EXP-0018 | PatternKV pattern subtraction | neutral |
|
1mo ago |
| EXP-0019 | Dual RTX 4090 multi-GPU validation | inconclusive |
prefill_turbo3 4987
prefill_q8 5117
decode_turbo3 36.27
decode_q8 103.57
prefill_turbo4 2542
decode_turbo4 17.63
|
1mo ago |
| EXP-0024 | turbo2 (2-bit) variant | unknown |
ppl_turbo2_uniform 6.7792
ppl_turbo3k_turbo2v 6.567
ppl_turbo2k_turbo3v 6.5203
ppl_turbo2k_q8v 6.4894
ppl_turbo2_la1 6.7411
ppl_turbo2_la2 6.6866
kv_memory_4k_mib 20
compression_vs_fp16 6.4
decode_tg128 30.64
|
1mo ago |
| EXP-0025 | Gemma-3 SWA V cache bug fix | success |
ppl_q8_baseline 5.6995
ppl_turbo3_kv_after_fix 5.8867
ppl_turbo3_k_only 5.9633
ppl_turbo3_kv_before_fix 45000000000000
|
1mo ago |
| EXP-0026 | turbo4 K Q pre-rotation bug fix | success |
ppl_turbo4_kv_qwen35_27b_fixed 5.8186
ppl_turbo4_kv_qwen3_14b_fixed 6.9118
ppl_turbo4_kv_qwen3_14b_broken 32643
ppl_turbo4v_qwen3_14b 6.6232
|
1mo ago |
| EXP-0027 | parallel_blocks tuning for turbo decode | neutral |
decode_pb_auto 29.95
decode_pb1 29.97
decode_pb2 29.95
decode_pb4 29.95
decode_pb8 29.96
decode_pb16 29.96
decode_pb32 29.93
decode_q8_baseline 30.81
|
1mo ago |
| EXP-0028 | CAT alignment correction analysis | negative |
theoretical_gain 0
|
1mo ago |
| EXP-0029 | Multi-sequence (n_seq > 1) dequant fix | success |
ppl_nseq1_before 6.31
ppl_nseq1_after 6.31
ppl_nseq2_before 17.1
ppl_nseq2_after 6.3
ppl_nseq4_before 22.56
ppl_nseq4_after 6.34
|
1mo ago |
| EXP-0030 | turbo4 prefill MMA (fp16 dequant tradeoff) | success |
prefill_pp4096_before 588
prefill_pp4096_after 1113
ppl_full_precision 5.8186
ppl_fp16_prefill 5.8966
decode_tg64 29.66
|
1mo ago |
| EXP-0031 | Speculative decoding with turbo KV | neutral |
throughput_q8_draft 28.78
throughput_turbo3_draft 28.85
n_drafted_q8 1864
n_drafted_turbo3 1936
normal_decode 31
|
1mo ago |
| EXP-0032 | Sign+magnitude encoding for turbo3 dequant | neutral |
decode_tg64_4k 30.05
decode_tg64_32k 29.91
decode_tg64_4k_baseline 30.04
decode_tg64_32k_baseline 29.83
ppl 5.8501
|
1mo ago |
| EXP-0033 | Long-context PPL validation | success |
ppl_2k_8c_la1 5.769
ppl_2k_8c_q8 5.8375
ppl_4k_4c_la1 6.3198
ppl_4k_4c_q8 6.2677
ppl_8k_4c_la1 7.3952
ppl_8k_4c_q8 7.4241
ppl_8k_4c_uniform 7.3783
|
1mo ago |
| EXP-0034 | SQuat query-orthogonal codebook (dropped) | neutral |
|
1mo ago |
| EXP-0001 | turbo3 baseline quality (head_dim=256) | success |
ppl 5.8323
ppl_q8_baseline 5.8375
compression_ratio 4.9
|
1mo ago |
| EXP-0003 | Layer-adaptive turbo3 (LA-1, first4+last4 q8_0) | success |
ppl 5.769
prefill_pp4096 1128
decode_tg64 30.25
compression_ratio 3.5
|
1mo ago |
| EXP-0005 | FWHT rotation ablation | success |
ppl_with_rotation 5.8323
ppl_without_rotation_with_norm 6.2357
ppl_without_rotation_without_norm 6.5249
ppl_q8_baseline 5.8375
|
1mo ago |
| EXP-0006 | Prefill dequant+MMA optimization | success |
prefill_pp4096_before 631
prefill_pp4096_after 1121
prefill_ratio_vs_q8 0.988
decode_tg64 30.1
ppl 5.8501
|
1mo ago |
| EXP-0013 | 128K context on 24GB GPU | success |
prefill_pp131072 669
decode_tg64_128k 29.85
vram_gb 23.5
q8_0_fits false
|
1mo ago |
| EXP-0020 | Layer-adaptive mode 2 (last 8 layers q8_0) | success |
ppl_la2_turbo3 5.814
ppl_la2_turbo4 5.8077
decode_tg64_la2_turbo3 29.98
decode_tg64_la2_turbo4 29.69
compression_ratio 3.5
|
1mo ago |
| EXP-0021 | Layer-adaptive modes 3, 4, 5 isolation tests | success |
ppl_la3_last4 5.8091
ppl_la4_first4 5.8211
ppl_la5_first2_last2 5.8091
compression_ratio_4layers 4.2
|
1mo ago |
| EXP-0022 | Asymmetric K/V type combinations | unknown |
ppl_turbo4k_q8v 5.8451
ppl_q8k_turbo3v 5.8451
ppl_turbo4k_turbo3v 5.8653
ppl_turbo3k_turbo4v 5.8212
decode_q8k_turbo3v 30.32
decode_turbo4k_q8v 30.15
|
1mo ago |
| EXP-0023 | Layer-adaptive + asymmetric combined | negative |
ppl_mode6_vonly_last8 5.839
ppl_mode7_konly_last8 5.839
ppl_mode8_vonly_2plus2 5.833
ppl_mode2_both_last8 5.814
|
1mo ago |