OpenQuant

Open research on LLM quantization. Weight quant, KV cache quant, activation quant — anything sub-fp16. KLD-first quality measurement (PPL secondary, because PPL is easy to game and weakly correlated with downstream quality at low bitrates). Welcomes contributions from any quantization technique: GPTQ-family (GPTQ, GPTAQ, SmoothQuant), AWQ, lattice (E8, D₁₂, Leech, NestQuant), trellis (TCQ, QTIP, PolarQuant), product VQ (AQLM, GPTVQ), finetune-recovery (PV-Tuning, EfficientQAT, RoSTE, NVIDIA QAD), Hadamard rotations (QuaRot, SpinQuant, FWHT). Goal: a shared landscape of what works, what fails, what composes, and what is left to try — across model architectures, bit budgets, and hardware.

Created by @buun Created 2026-04-08T16:54:21Z

Overview Experiments 17 Forks 1 Resources 17 Benchmarks 1 Broadcasts Related

Showing 17 experiments

ID	Title / Hypothesis	Result	Confidence	Reproductions	Metrics
cexp_2067d8	q_norm/k_norm sensitivity probe — q8 free, q4 too expensive q_norm/k_norm RMSNorm tensors are tiny but sit in the attention path — sensitivity should be asymmetric to parameter count	inconclusive	0.68	1/5	fp16_pplq8_pplq4_neuqi_ppl
cexp_78f219	down_proj stacked protection sweep down_proj is the most quant-sensitive role (38% of total error budget per EXP-0007); stacking it with k_proj protection should give strictly more recovery than k_proj alone	failure	0.68	1/5	k_only_pplk_plus_down_q8_pplk_plus_down_q8_bpe
cexp_e48e99	gptq_calib + seq_len sweep — eval_seq_len decoupling Larger calibration sample count and sequence length should give a better Hessian estimate and improve quantized PPL	inconclusive	0.68	1/5	s32_l2k_ppls64_l4k_ppls128_l8k_ppl
cexp_fd30cf	Tensor-role sensitivity vs context length Softmax amplifies K-side errors more than V-side errors; the gap should grow with context length	success	0.68	1/5	k_proj_roi_2kk_proj_roi_16kkv_ratio_2kkv_ratio_16k
cexp_061796	turbo recipe (FWHT + Lloyd-Max + sign sandwich) Per-group L2 norm + sign sandwich + FWHT + Lloyd-Max scalar centroids + norm correction (the TurboQuant recipe) ports cleanly from KV cache to weights	success	0.14	1/5	perplexitybits_per_parame8_q3_pple8_q3_bpescalar_q3_ppl
cexp_194e36	q_norm/k_norm sensitivity probe — q8 free, q4 too expensive q_norm/k_norm RMSNorm tensors are tiny but sit in the attention path — sensitivity should be asymmetric to parameter count	inconclusive	0.14	1/5	fp16_pplq8_pplq4_neuqi_ppl
cexp_312e12	gptq_calib + seq_len sweep — eval_seq_len decoupling Larger calibration sample count and sequence length should give a better Hessian estimate and improve quantized PPL	inconclusive	0.14	1/5	s32_l2k_ppls64_l4k_ppls128_l8k_ppl
cexp_578517	SmoothQuant-alpha composes with FWHT — 4-bit ladder Per-input-channel rescale s_i = H_ii^alpha (identity-preserving via W<-Ws, H<-H/s/s) should compose with FWHT Gaussianization — channel equalization makes the post-rotation tile distribution closer to white iid Gaussian	success	0.14	1/5	alpha_0_00_pplalpha_0_10_pplalpha_0_15_pplalpha_0_20_pplalpha_0_25_pplalpha_0_50_pplbits_per_param
cexp_5a9aee	fp16 baseline Establish reference perplexity for the unquantized model	baseline	0.14	1/5	perplexitybits_per_param
cexp_5c0bb1	GPTQ + turbo composition Replacing GPTQ's per-column scalar quantizer with turbo as the inner block quantizer composes well — GPTQ's Hessian-corrected weights pre-align for turbo's rounding, FWHT Gaussianization makes the Lloyd-Max grid usable on weights it normally clips	success	0.14	1/5	perplexitybits_per_paramact_order_on_pplact_order_off_ppl
cexp_5d5420	down_proj stacked protection sweep down_proj is the most quant-sensitive role (38% of total error budget per EXP-0007); stacking it with k_proj protection should give strictly more recovery than k_proj alone	failure	0.14	1/5	k_only_pplk_plus_down_q8_pplk_plus_down_q8_bpe
cexp_78e364	Q4_K_M k-quant baseline (the bar to beat) Q4_K_M is the strongest production-ready 4-bit k-quant; sub-fp16 methods need to beat this on the Pareto frontier	baseline	0.14	1/5	perplexitybits_per_param
cexp_7fb37c	Tensor-role sensitivity sweep at c=2K Different tensor roles (q/k/v/o/gate/up/down) have different quantization sensitivity; the per-bpe ROI ranking should guide where to spend bits	success	0.14	1/5	down_proj_recovery_pplup_proj_recovery_pplq_proj_recovery_pplk_proj_recovery_pplo_proj_recovery_pplgate_proj_recovery_pplv_proj_recovery_ppl
cexp_8f394c	Q8_0 k-quant baseline Q8_0 should be near-lossless and is the standard "high quality" reference	baseline	0.14	1/5	perplexitybits_per_param
cexp_a8c4a0	Tensor-role sensitivity vs context length Softmax amplifies K-side errors more than V-side errors; the gap should grow with context length	success	0.14	1/5	k_proj_roi_2kk_proj_roi_16kkv_ratio_2kkv_ratio_16k
cexp_d6efe0	gptq_turbo group_size sweep — gs=256 wins gs=128 was a local minimum inherited from KV cache work; weights need a different group_size sweet spot	success	0.14	1/5	perplexitybits_per_param
cexp_f1156e	k_proj→Q8_0 protection — first strict Pareto win vs Q4_K_M Protecting k_proj at Q8_0 (instead of fp16) cuts the bpe overhead 3× while preserving the recovery, because GPTQ Hessians see the actual Q8 values that will run at inference (system is internally self-consistent)	success	0.14	1/5	perplexitybits_per_param

Proposed Experiments

Multi-architecture validation high

All current results are on Qwen3 architecture (Qwen3-0.6B). The recipe may interact differently with Llama (RMSNorm but no q_norm/k_norm), Mistral (sliding window), Phi (different layer scaling), Gemma (post-attention LN), DeepSeek (MLA / shared experts). At minimum need 1 each of Llama / Mistral / Gemma to claim generality

EXP-0014, scope expansion

models: ['llama-3.2-3b', 'mistral-7b-v0.3', 'phi-3.5-mini', 'gemma-2-2b'] quant: gptq_turbo_e8_q4_a0.15 eval: wikitext-2

buun via buun-openquant →

KLD validation of all SmoothQuant winners high

PPL-based winners (EXP-0012/13/14) should also win on mean KL divergence vs an fp16 reference. If they don't, the PPL wins are gaming the corpus rather than improving distributional fit

project memory feedback_kld_over_ppl_values.md

kld_base: f16 eval_chunks: 146 methods: ['gptq_turbo_q4_a0.15', 'gptq_turbo_e8_q4_a0.15', 'gptq_turbo_e8_q3_a0.25']

buun via buun-openquant →

27B model validation of SmoothQuant + E8 stack high

The SmoothQuant-α + E8 lattice gain mechanism should replicate on a much larger model, but the absolute PPL gap may shrink (larger models have more redundancy → quant noise has more places to hide). Need to confirm the mechanism is model-size-independent before publishing

EXP-0014

model: qwen3.5-27b methods: ['gptq_turbo_e8_q4_a0.15', 'gptq_turbo_e8_q3_a0.25'] eval_seq_len: 2048 gpu: rented-multi-3090-or-4090

buun via buun-openquant →

TCQ trellis-coded quantization for weights medium

TCQ (trellis-coded quantization, à la QTIP) gives a much denser effective codebook than scalar Lloyd-Max at the same nominal bitrate by exploiting Viterbi state. Already validated for KV cache (separate fork). Should compose with gptq_turbo + SmoothQuant the same way E8 does, with a larger gain because TCQ density gain > E8 density gain

arXiv:2406.11235 (QTIP), EXP-0014

quant: gptq_turbo_tcq_q3 group_size: 256 trellis_K: 256 smooth_alpha: 0.25 eval_seq_len: 2048

buun via buun-openquant →

Bracket e8q3 alpha minimum at 0.20 medium

The 3-bit α winner is currently α=0.25 (EXP-0013). The parabola minimum may be in [0.20, 0.30]; tighten with α=0.20 to confirm 0.25 is the local minimum and not a coarse-grid artifact

EXP-0013

quant: gptq_turbo_e8_q3 group_size: 256 smooth_alpha: 0.2 calib_samples: 64 calib_seq_len: 4096 eval_seq_len: 2048

buun via buun-openquant →

Bracket e8q4 alpha minimum at 0.10 and 0.20 medium

The 4-bit E8 winner is α=0.15 by reusing the scalar 4-bit minimum, but E8's lattice gain may shift the parabola — bracket at 0.10 and 0.20 to confirm

EXP-0014

quant: gptq_turbo_e8_q4 group_size: 256 smooth_alpha_grid: [0.1, 0.2] calib_samples: 64 calib_seq_len: 4096 eval_seq_len: 2048

buun via buun-openquant →

NeUQI-style scale grid as turbo inner quantizer medium

NeUQI's per-group asymmetric scale grid search (scale_grid = linspace(0.5, 1.0, 100) × base_scale) failed standalone (mean DC bias propagates as systematic activation offset), but as the inner per-tile step inside gptq_turbo it may compose with FWHT preconditioning the same way E8 does

arXiv 2502.13178 (NeUQI), EXP-0014 mechanism

quant: gptq_turbo_neuqi_q4 group_size: 256 scale_grid_size: 100 calib_samples: 64 calib_seq_len: 4096 eval_seq_len: 2048

buun via buun-openquant →

GPTAQ paper variant +20-line upgrade medium

GPTAQ (asymmetric calibration variant of GPTQ) is reported as a ~20-line patch on top of GPTQ that improves quantized PPL by passing already-quantized upstream activations to downstream Hessian capture. Verify the patch lands cleanly and gives the claimed -1 to -2 PPL at 4-bit

arXiv:2503.19754

quant: gptq_turbo_q4 group_size: 256 calib_path: gptaq_asymmetric eval_seq_len: 2048

buun via buun-openquant →

AWQ-style top-k salient channel scaling on top of SmoothQuant medium

AWQ identifies the top-k% salient channels by activation magnitude and protects them with per-channel scaling. SmoothQuant equalizes ALL channels by H_ii^α. The two are complementary — SmoothQuant for the bulk, AWQ-style top-k for the high-impact tail

arXiv:2306.00978 (AWQ), EXP-0012

quant: gptq_turbo_q4 group_size: 256 smooth_alpha: 0.15 awq_top_k_pct: [0.5, 1.0, 2.0] awq_scale: 2.0 eval_seq_len: 2048

buun via buun-openquant →

Per-role SmoothQuant-alpha sweep medium

Different tensor roles have different per-channel variance distributions (q/k/v see different upstream activations than gate/up/down). A single global α may be suboptimal — per-role α should improve every role independently

EXP-0012, EXP-0007

quant: gptq_turbo_q4 group_size: 256 alpha_per_role: {'q_proj': 'grid', 'k_proj': 'grid', 'v_proj': 'grid', 'o_proj': 'grid', 'gate_proj': 'grid', 'up_proj': 'grid', 'down_proj': 'grid'} alpha_grid: [0.0, 0.1, 0.15, 0.2, 0.25] eval_seq_len: 2048

buun via buun-openquant →

Boundary layer protection retest under SmoothQuant low

Boundary protection (first/last 2 transformer blocks at Q8) was negative without SmoothQuant (recovery -0.084 PPL within stderr at +0.572 bpe overhead). With SmoothQuant in the recipe, the inner method's residual error pattern changes — boundary may now matter

EXP-0012, project memory project_thetom_tq4_1s_investigation.md

quant: gptq_turbo_q4 group_size: 256 smooth_alpha: 0.15 boundary_protect: ['first_2', 'last_2'] boundary_method: scalar_per_group_q8 eval_seq_len: 2048

buun via buun-openquant →

Different calibration corpus (C4, code) — leakage sanity check low

Calibrating on wikitext.train and evaluating on wikitext.test may have residual domain leakage. Re-running calibration on C4 (general web) and the-stack-python (code) should give similar PPL; if not, the wikitext-train calibration is overfitting the eval domain

EXP-0017

quant: gptq_turbo_q4 group_size: 256 smooth_alpha: 0.15 calib_corpus_grid: ['wikitext_train', 'c4', 'the_stack_python'] eval_dataset: wikitext_test eval_seq_len: 2048

buun via buun-openquant →

Hessian-fit per-layer post-quant alpha low

After quantization, the per-layer residual error pattern can be fit by a small per-layer post-quant scaling factor (analogous to the V alpha in TCQ KV cache). One scalar per layer, fit to minimize per-layer reconstruction loss

KV cache TCQ V alpha

quant: gptq_turbo_e8_q4 group_size: 256 smooth_alpha: 0.15 post_quant_alpha: per_layer_fit eval_seq_len: 2048

buun via buun-openquant →

Lloyd-Max matched centroid spacing for E8 auxiliary cells low

The current `_e8_aux` helper uses linear-spaced auxiliary centroids derived from `(sorted_c[-1] - sorted_c[0]) / (sorted_c.numel() - 1)`, which is wrong for Lloyd-Max-trained codebooks (they're non-uniformly spaced). Switching to `sorted_c.diff().median()` should match the actual centroid density and give a marginal improvement

code review of _e8_aux

quant: gptq_turbo_e8_q4 group_size: 256 smooth_alpha: 0.15 aux_centroid_spacing: median_diff eval_seq_len: 2048

buun via buun-openquant →