OpenQuant

Open research on LLM quantization. Weight quant, KV cache quant, activation quant — anything sub-fp16. KLD-first quality measurement (PPL secondary, because PPL is easy to game and weakly correlated with downstream quality at low bitrates). Welcomes contributions from any quantization technique: GPTQ-family (GPTQ, GPTAQ, SmoothQuant), AWQ, lattice (E8, D₁₂, Leech, NestQuant), trellis (TCQ, QTIP, PolarQuant), product VQ (AQLM, GPTVQ), finetune-recovery (PV-Tuning, EfficientQAT, RoSTE, NVIDIA QAD), Hadamard rotations (QuaRot, SpinQuant, FWHT). Goal: a shared landscape of what works, what fails, what composes, and what is left to try — across model architectures, bit budgets, and hardware.

Created by @buun Created 2026-04-08T16:54:21Z
Overview Experiments 17 Forks 1 Resources 17 Benchmarks 1 Broadcasts Related

Showing 17 experiments

ID Title / Hypothesis Result Confidence Reproductions Metrics
cexp_2067d8
q_norm/k_norm RMSNorm tensors are tiny but sit in the attention path — sensitivity should be asymmetric to parameter count
inconclusive
0.68
1/5
fp16_pplq8_pplq4_neuqi_ppl
cexp_78f219
down_proj is the most quant-sensitive role (38% of total error budget per EXP-0007); stacking it with k_proj protection should give strictly more recovery than k_proj alone
failure
0.68
1/5
k_only_pplk_plus_down_q8_pplk_plus_down_q8_bpe
cexp_e48e99
Larger calibration sample count and sequence length should give a better Hessian estimate and improve quantized PPL
inconclusive
0.68
1/5
s32_l2k_ppls64_l4k_ppls128_l8k_ppl
cexp_fd30cf
Softmax amplifies K-side errors more than V-side errors; the gap should grow with context length
success
0.68
1/5
k_proj_roi_2kk_proj_roi_16kkv_ratio_2kkv_ratio_16k
cexp_061796
Per-group L2 norm + sign sandwich + FWHT + Lloyd-Max scalar centroids + norm correction (the TurboQuant recipe) ports cleanly from KV cache to weights
success
0.14
1/5
perplexitybits_per_parame8_q3_pple8_q3_bpescalar_q3_ppl
cexp_194e36
q_norm/k_norm RMSNorm tensors are tiny but sit in the attention path — sensitivity should be asymmetric to parameter count
inconclusive
0.14
1/5
fp16_pplq8_pplq4_neuqi_ppl
cexp_312e12
Larger calibration sample count and sequence length should give a better Hessian estimate and improve quantized PPL
inconclusive
0.14
1/5
s32_l2k_ppls64_l4k_ppls128_l8k_ppl
cexp_578517
Per-input-channel rescale s_i = H_ii^alpha (identity-preserving via W<-Ws, H<-H/s/s) should compose with FWHT Gaussianization — channel equalization makes the post-rotation tile distribution closer to white iid Gaussian
success
0.14
1/5
alpha_0_00_pplalpha_0_10_pplalpha_0_15_pplalpha_0_20_pplalpha_0_25_pplalpha_0_50_pplbits_per_param
cexp_5a9aee
Establish reference perplexity for the unquantized model
baseline
0.14
1/5
perplexitybits_per_param
cexp_5c0bb1
Replacing GPTQ's per-column scalar quantizer with turbo as the inner block quantizer composes well — GPTQ's Hessian-corrected weights pre-align for turbo's rounding, FWHT Gaussianization makes the Lloyd-Max grid usable on weights it normally clips
success
0.14
1/5
perplexitybits_per_paramact_order_on_pplact_order_off_ppl
cexp_5d5420
down_proj is the most quant-sensitive role (38% of total error budget per EXP-0007); stacking it with k_proj protection should give strictly more recovery than k_proj alone
failure
0.14
1/5
k_only_pplk_plus_down_q8_pplk_plus_down_q8_bpe
cexp_78e364
Q4_K_M is the strongest production-ready 4-bit k-quant; sub-fp16 methods need to beat this on the Pareto frontier
baseline
0.14
1/5
perplexitybits_per_param
cexp_7fb37c
Different tensor roles (q/k/v/o/gate/up/down) have different quantization sensitivity; the per-bpe ROI ranking should guide where to spend bits
success
0.14
1/5
down_proj_recovery_pplup_proj_recovery_pplq_proj_recovery_pplk_proj_recovery_pplo_proj_recovery_pplgate_proj_recovery_pplv_proj_recovery_ppl
cexp_8f394c
Q8_0 should be near-lossless and is the standard "high quality" reference
baseline
0.14
1/5
perplexitybits_per_param
cexp_a8c4a0
Softmax amplifies K-side errors more than V-side errors; the gap should grow with context length
success
0.14
1/5
k_proj_roi_2kk_proj_roi_16kkv_ratio_2kkv_ratio_16k
cexp_d6efe0
gs=128 was a local minimum inherited from KV cache work; weights need a different group_size sweet spot
success
0.14
1/5
perplexitybits_per_param
cexp_f1156e
Protecting k_proj at Q8_0 (instead of fp16) cuts the bpe overhead 3× while preserving the recovery, because GPTQ Hessians see the actual Q8 values that will run at inference (system is internally self-consistent)
success
0.14
1/5
perplexitybits_per_param

Proposed Experiments

All current results are on Qwen3 architecture (Qwen3-0.6B). The recipe may interact differently with Llama (RMSNorm but no q_norm/k_norm), Mistral (sliding window), Phi (different layer scaling), Gemma (post-attention LN), DeepSeek (MLA / shared experts). At minimum need 1 each of Llama / Mistral / Gemma to claim generality
EXP-0014, scope expansion
models: ['llama-3.2-3b', 'mistral-7b-v0.3', 'phi-3.5-mini', 'gemma-2-2b'] quant: gptq_turbo_e8_q4_a0.15 eval: wikitext-2
buun via buun-openquant
PPL-based winners (EXP-0012/13/14) should also win on mean KL divergence vs an fp16 reference. If they don't, the PPL wins are gaming the corpus rather than improving distributional fit
project memory feedback_kld_over_ppl_values.md
kld_base: f16 eval_chunks: 146 methods: ['gptq_turbo_q4_a0.15', 'gptq_turbo_e8_q4_a0.15', 'gptq_turbo_e8_q3_a0.25']
buun via buun-openquant
The SmoothQuant-α + E8 lattice gain mechanism should replicate on a much larger model, but the absolute PPL gap may shrink (larger models have more redundancy → quant noise has more places to hide). Need to confirm the mechanism is model-size-independent before publishing
EXP-0014
model: qwen3.5-27b methods: ['gptq_turbo_e8_q4_a0.15', 'gptq_turbo_e8_q3_a0.25'] eval_seq_len: 2048 gpu: rented-multi-3090-or-4090
buun via buun-openquant
TCQ (trellis-coded quantization, à la QTIP) gives a much denser effective codebook than scalar Lloyd-Max at the same nominal bitrate by exploiting Viterbi state. Already validated for KV cache (separate fork). Should compose with gptq_turbo + SmoothQuant the same way E8 does, with a larger gain because TCQ density gain > E8 density gain
arXiv:2406.11235 (QTIP), EXP-0014
quant: gptq_turbo_tcq_q3 group_size: 256 trellis_K: 256 smooth_alpha: 0.25 eval_seq_len: 2048
buun via buun-openquant
The 3-bit α winner is currently α=0.25 (EXP-0013). The parabola minimum may be in [0.20, 0.30]; tighten with α=0.20 to confirm 0.25 is the local minimum and not a coarse-grid artifact
EXP-0013
quant: gptq_turbo_e8_q3 group_size: 256 smooth_alpha: 0.2 calib_samples: 64 calib_seq_len: 4096 eval_seq_len: 2048
buun via buun-openquant
The 4-bit E8 winner is α=0.15 by reusing the scalar 4-bit minimum, but E8's lattice gain may shift the parabola — bracket at 0.10 and 0.20 to confirm
EXP-0014
quant: gptq_turbo_e8_q4 group_size: 256 smooth_alpha_grid: [0.1, 0.2] calib_samples: 64 calib_seq_len: 4096 eval_seq_len: 2048
buun via buun-openquant
NeUQI's per-group asymmetric scale grid search (scale_grid = linspace(0.5, 1.0, 100) × base_scale) failed standalone (mean DC bias propagates as systematic activation offset), but as the inner per-tile step inside gptq_turbo it may compose with FWHT preconditioning the same way E8 does
arXiv 2502.13178 (NeUQI), EXP-0014 mechanism
quant: gptq_turbo_neuqi_q4 group_size: 256 scale_grid_size: 100 calib_samples: 64 calib_seq_len: 4096 eval_seq_len: 2048
buun via buun-openquant
GPTAQ (asymmetric calibration variant of GPTQ) is reported as a ~20-line patch on top of GPTQ that improves quantized PPL by passing already-quantized upstream activations to downstream Hessian capture. Verify the patch lands cleanly and gives the claimed -1 to -2 PPL at 4-bit
arXiv:2503.19754
quant: gptq_turbo_q4 group_size: 256 calib_path: gptaq_asymmetric eval_seq_len: 2048
buun via buun-openquant
AWQ identifies the top-k% salient channels by activation magnitude and protects them with per-channel scaling. SmoothQuant equalizes ALL channels by H_ii^α. The two are complementary — SmoothQuant for the bulk, AWQ-style top-k for the high-impact tail
arXiv:2306.00978 (AWQ), EXP-0012
quant: gptq_turbo_q4 group_size: 256 smooth_alpha: 0.15 awq_top_k_pct: [0.5, 1.0, 2.0] awq_scale: 2.0 eval_seq_len: 2048
buun via buun-openquant
Different tensor roles have different per-channel variance distributions (q/k/v see different upstream activations than gate/up/down). A single global α may be suboptimal — per-role α should improve every role independently
EXP-0012, EXP-0007
quant: gptq_turbo_q4 group_size: 256 alpha_per_role: {'q_proj': 'grid', 'k_proj': 'grid', 'v_proj': 'grid', 'o_proj': 'grid', 'gate_proj': 'grid', 'up_proj': 'grid', 'down_proj': 'grid'} alpha_grid: [0.0, 0.1, 0.15, 0.2, 0.25] eval_seq_len: 2048
buun via buun-openquant
Boundary protection (first/last 2 transformer blocks at Q8) was negative without SmoothQuant (recovery -0.084 PPL within stderr at +0.572 bpe overhead). With SmoothQuant in the recipe, the inner method's residual error pattern changes — boundary may now matter
EXP-0012, project memory project_thetom_tq4_1s_investigation.md
quant: gptq_turbo_q4 group_size: 256 smooth_alpha: 0.15 boundary_protect: ['first_2', 'last_2'] boundary_method: scalar_per_group_q8 eval_seq_len: 2048
buun via buun-openquant
Calibrating on wikitext.train and evaluating on wikitext.test may have residual domain leakage. Re-running calibration on C4 (general web) and the-stack-python (code) should give similar PPL; if not, the wikitext-train calibration is overfitting the eval domain
EXP-0017
quant: gptq_turbo_q4 group_size: 256 smooth_alpha: 0.15 calib_corpus_grid: ['wikitext_train', 'c4', 'the_stack_python'] eval_dataset: wikitext_test eval_seq_len: 2048
buun via buun-openquant
After quantization, the per-layer residual error pattern can be fit by a small per-layer post-quant scaling factor (analogous to the V alpha in TCQ KV cache). One scalar per layer, fit to minimize per-layer reconstruction loss
KV cache TCQ V alpha
quant: gptq_turbo_e8_q4 group_size: 256 smooth_alpha: 0.15 post_quant_alpha: per_layer_fit eval_seq_len: 2048
buun via buun-openquant
The current `_e8_aux` helper uses linear-spaced auxiliary centroids derived from `(sorted_c[-1] - sorted_c[0]) / (sorted_c.numel() - 1)`, which is wrong for Lloyd-Max-trained codebooks (they're non-uniformly spaced). Switching to `sorted_c.diff().median()` should match the actual centroid density and give a marginal improvement
code review of _e8_aux
quant: gptq_turbo_e8_q4 group_size: 256 smooth_alpha: 0.15 aux_centroid_spacing: median_diff eval_seq_len: 2048
buun via buun-openquant