OpenQuant

Open research on LLM quantization. Weight quant, KV cache quant, activation quant — anything sub-fp16. KLD-first quality measurement (PPL secondary, because PPL is easy to game and weakly correlated with downstream quality at low bitrates). Welcomes contributions from any quantization technique: GPTQ-family (GPTQ, GPTAQ, SmoothQuant), AWQ, lattice (E8, D₁₂, Leech, NestQuant), trellis (TCQ, QTIP, PolarQuant), product VQ (AQLM, GPTVQ), finetune-recovery (PV-Tuning, EfficientQAT, RoSTE, NVIDIA QAD), Hadamard rotations (QuaRot, SpinQuant, FWHT). Goal: a shared landscape of what works, what fails, what composes, and what is left to try — across model architectures, bit budgets, and hardware.

quantization weight-quantization kv-cache gptq awq smoothquant tcq trellis lattice-quantization llm compression pareto kld fwht hadamard

Created by @buun Created 2026-04-08T16:54:21Z

Overview Experiments 17 Forks 1 Resources 17 Benchmarks 1 Broadcasts Related

17 resources tracked

huggingface

HF

Gradient-free fine-tuning specifically for trellis/lattice quantizers — fits TCQ's discrete structure better than STE

https://arxiv.org/abs/2405.16406

HF

wikitext-2-raw-v1 test split is the canonical eval corpus for this project

https://huggingface.co/datasets/wikitext

github

GH

QuIP-sharp — Hadamard-preconditioned high-quality 4-bit weight quant

https://github.com/Cornell-RelaxML/quip-sharp

GH

AWQ reference + activation-aware salient channel scaling

https://github.com/mit-han-lab/llm-awq

GH

SmoothQuant reference. The α=0.5 default in their code is wrong for FWHT-preconditioned pipelines (we find α≈0.15-0.25)

https://github.com/mit-han-lab/smoothquant

GH

Reference implementation for k-quants and the canonical eval (llama-perplexity)

https://github.com/ggml-org/llama.cpp

GH

Original GPTQ reference implementation

https://github.com/IST-DASLab/gptq

paper

Ar

Per-group asymmetric scale grid search. Failed standalone in our pipeline (mean DC bias propagates) but may compose as inner step in gptq_turbo

https://arxiv.org/abs/2502.13178

Ar

Top-k% salient channel scaling. Complementary to SmoothQuant (bulk equalization vs tail protection)

https://arxiv.org/abs/2306.00978

Ar

TCQ for weights with QuIP-style incoherence processing. Same family as our KV cache TCQ work

https://arxiv.org/abs/2406.11235

Ar

QTIP / TCQ for weights — trellis-coded quantization, much denser effective codebook than scalar at the same nominal bitrate

https://github.com/Cornell-RelaxML/qtip

Ar

E8 lattice (and Leech) for weight quant. ~1.42× density gain over scalar Lloyd-Max for white Gaussian sources. Relies on rotation preconditioning to make the input distribution white

https://arxiv.org/abs/2502.09720

Ar

~20-line patch on top of GPTQ that improves PPL by ~1-2 at 4-bit. Reportedly trivial to integrate

https://arxiv.org/abs/2503.19754

Ar

Two-phase QAT (block-wise then end-to-end) — recovers most of the PPL gap at sub-4-bit

https://arxiv.org/abs/2411.02355

Ar

The Hessian-aware sequential column quantizer that everything else composes with

https://arxiv.org/abs/2210.17323

Ar

Per-input-channel rescale s_i = max(|X_i|)^α / max(|W_:i|)^(1-α). The default α=0.5 is wrong post-FWHT — channel equalization needs to be lighter when the rotation is already absorbing per-channel variance

https://arxiv.org/abs/2211.10438

Ar

Ordentlich-Polyanskiy. Theoretical justification for Hadamard preconditioning + nested lattice quantization. Phase transition at 0.906 bits

https://arxiv.org/abs/2410.13780