Open research on LLM quantization. Weight quant, KV cache quant, activation quant — anything sub-fp16. KLD-first quality measurement (PPL secondary, because PPL is easy to game and weakly correlated with downstream quality at low bitrates). Welcomes contributions from any quantization technique: GPTQ-family (GPTQ, GPTAQ, SmoothQuant), AWQ, lattice (E8, D₁₂, Leech, NestQuant), trellis (TCQ, QTIP, PolarQuant), product VQ (AQLM, GPTVQ), finetune-recovery (PV-Tuning, EfficientQAT, RoSTE, NVIDIA QAD), Hadamard rotations (QuaRot, SpinQuant, FWHT). Goal: a shared landscape of what works, what fails, what composes, and what is left to try — across model architectures, bit budgets, and hardware.
Showing 17 experiments
| ID | Title / Hypothesis | Result | Confidence | Reproductions | Metrics |
|---|---|---|---|---|---|
| cexp_2067d8 |
q_norm/k_norm RMSNorm tensors are tiny but sit in the attention path — sensitivity should be asymmetric to parameter count
|
inconclusive |
1/5
|
fp16_pplq8_pplq4_neuqi_ppl
|
|
| cexp_78f219 |
down_proj is the most quant-sensitive role (38% of total error budget per EXP-0007); stacking it with k_proj protection should give strictly more recovery than k_proj alone
|
failure |
1/5
|
k_only_pplk_plus_down_q8_pplk_plus_down_q8_bpe
|
|
| cexp_e48e99 |
Larger calibration sample count and sequence length should give a better Hessian estimate and improve quantized PPL
|
inconclusive |
1/5
|
s32_l2k_ppls64_l4k_ppls128_l8k_ppl
|
|
| cexp_fd30cf |
Softmax amplifies K-side errors more than V-side errors; the gap should grow with context length
|
success |
1/5
|
k_proj_roi_2kk_proj_roi_16kkv_ratio_2kkv_ratio_16k
|
|
| cexp_061796 |
Per-group L2 norm + sign sandwich + FWHT + Lloyd-Max scalar centroids + norm correction (the TurboQuant recipe) ports cleanly from KV cache to weights
|
success |
1/5
|
perplexitybits_per_parame8_q3_pple8_q3_bpescalar_q3_ppl
|
|
| cexp_194e36 |
q_norm/k_norm RMSNorm tensors are tiny but sit in the attention path — sensitivity should be asymmetric to parameter count
|
inconclusive |
1/5
|
fp16_pplq8_pplq4_neuqi_ppl
|
|
| cexp_312e12 |
Larger calibration sample count and sequence length should give a better Hessian estimate and improve quantized PPL
|
inconclusive |
1/5
|
s32_l2k_ppls64_l4k_ppls128_l8k_ppl
|
|
| cexp_578517 |
Per-input-channel rescale s_i = H_ii^alpha (identity-preserving via W<-Ws, H<-H/s/s) should compose with FWHT Gaussianization — channel equalization makes the post-rotation tile distribution closer to white iid Gaussian
|
success |
1/5
|
alpha_0_00_pplalpha_0_10_pplalpha_0_15_pplalpha_0_20_pplalpha_0_25_pplalpha_0_50_pplbits_per_param
|
|
| cexp_5a9aee |
Establish reference perplexity for the unquantized model
|
baseline |
1/5
|
perplexitybits_per_param
|
|
| cexp_5c0bb1 |
Replacing GPTQ's per-column scalar quantizer with turbo as the inner block quantizer composes well — GPTQ's Hessian-corrected weights pre-align for turbo's rounding, FWHT Gaussianization makes the Lloyd-Max grid usable on weights it normally clips
|
success |
1/5
|
perplexitybits_per_paramact_order_on_pplact_order_off_ppl
|
|
| cexp_5d5420 |
down_proj is the most quant-sensitive role (38% of total error budget per EXP-0007); stacking it with k_proj protection should give strictly more recovery than k_proj alone
|
failure |
1/5
|
k_only_pplk_plus_down_q8_pplk_plus_down_q8_bpe
|
|
| cexp_78e364 |
Q4_K_M is the strongest production-ready 4-bit k-quant; sub-fp16 methods need to beat this on the Pareto frontier
|
baseline |
1/5
|
perplexitybits_per_param
|
|
| cexp_7fb37c |
Different tensor roles (q/k/v/o/gate/up/down) have different quantization sensitivity; the per-bpe ROI ranking should guide where to spend bits
|
success |
1/5
|
down_proj_recovery_pplup_proj_recovery_pplq_proj_recovery_pplk_proj_recovery_pplo_proj_recovery_pplgate_proj_recovery_pplv_proj_recovery_ppl
|
|
| cexp_8f394c |
Q8_0 should be near-lossless and is the standard "high quality" reference
|
baseline |
1/5
|
perplexitybits_per_param
|
|
| cexp_a8c4a0 |
Softmax amplifies K-side errors more than V-side errors; the gap should grow with context length
|
success |
1/5
|
k_proj_roi_2kk_proj_roi_16kkv_ratio_2kkv_ratio_16k
|
|
| cexp_d6efe0 |
gs=128 was a local minimum inherited from KV cache work; weights need a different group_size sweet spot
|
success |
1/5
|
perplexitybits_per_param
|
|
| cexp_f1156e |
Protecting k_proj at Q8_0 (instead of fp16) cuts the bpe overhead 3× while preserving the recovery, because GPTQ Hessians see the actual Q8 values that will run at inference (system is internally self-consistent)
|
success |
1/5
|
perplexitybits_per_param
|