Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.
Showing 87 experiments
| ID | Title / Hypothesis | Result | Confidence | Reproductions | Metrics |
|---|---|---|---|---|---|
| cexp_55ab69 |
Promoting quality-sensitive layers to q8_0 improves PPL while maintaining compression
|
success |
1/5
|
pplprefill_pp4096decode_tg64compression_ratio
|
|
| cexp_59b5fc |
Training codebooks to lower MSE always improves perplexity
|
negative |
1/5
|
ppl_50iterppl_200iter
|
|
| cexp_5c19ee |
Per-channel RMS-based scaling before L2 norm + FWHT reduces turbo3 quantization error on head_dim=128 where channels are anisotropic
|
success |
1/5
|
pplppl_q8_baselineppl_turbo3_no_innerq
|
|
| cexp_601201 |
MLX TurboQuant preserves needle-in-haystack retrieval
|
success |
1/5
|
niah_2_5bit_scoreniah_2_5bit_totalniah_3_5bit_scoreniah_3_5bit_total
|
|
| cexp_6237df |
Alpha should vary logarithmically with context length to track the changing optimal operating point
|
success |
1/5
|
alpha_3bit_2kalpha_3bit_32kalpha_2bit_2kalpha_2bit_32k
|
|
| cexp_697142 |
TCQ (512-state bitshift trellis with Viterbi encode, O(1) sliding-window decode) improves KV cache quality over scalar Lloyd-Max
|
success |
1/5
|
ppl_turbo3_tcqppl_turbo2_tcqppl_turbo2_scalarppl_q8_baseline
|
|
| cexp_69a1fe |
Reducing constant memory addresses from 8 to 4 via magnitude-only LUT with XOR sign recovery improves decode on pre-M5 Apple Silicon
|
success |
1/5
|
decode_tok_s_8kdecode_ratio_vs_q8vs_ceiling_pctspeedup_vs_baseline
|
|
| cexp_6e68ef |
Inlining dequant into flash attention loop eliminates function call overhead
|
failure |
1/5
|
decode_tok_s_8kvs_ceiling_pct
|
|
| cexp_6fa654 |
α = ||x||·dot(x,q)/||q||² halves per-element MSE vs L2-preserving β = ||x||/||q||
|
negative |
1/5
|
ppl_l2_preservingppl_mse_optimal
|
|
| cexp_765085 |
Different quantization types for K vs V can improve quality/speed tradeoff
|
inconclusive |
1/5
|
ppl_turbo4k_q8vppl_q8k_turbo3vppl_turbo4k_turbo3vppl_turbo3k_turbo4vdecode_q8k_turbo3vdecode_turbo4k_q8v
|
|
| cexp_799341 |
Sparse V benefits are not turbo3-specific, also work on q8_0
|
success |
1/5
|
decode_speedupppl_changeniah_change
|
|
| cexp_80c0cd |
Deferring norm multiply to after LUT lookup improves ILP
|
failure |
1/5
|
decode_tok_s_8kvs_ceiling_pct
|
|
| cexp_8410e2 |
Establish dequant baseline with standard 8-entry constant LUT
|
baseline |
1/5
|
decode_tok_s_8kdecode_ratio_vs_q8vs_ceiling_pctceiling_tok_s
|
|
| cexp_8593ff |
Reorder FWHT output by sequency to group similar-frequency components
|
neutral |
1/5
|
ppl_baselineppl_walsh
|
|
| cexp_8aeaf9 |
Skipping V dequant for attention weights below 1e-6 has zero quality impact and improves decode speed
|
success |
1/5
|
decode_speedup_32kppl_sparse_v_onppl_sparse_v_offniah_sparse_vniah_sparse_v_totalniah_q8_0niah_q8_0_totalskip_rate_512skip_rate_4kskip_rate_32k
|
|
| cexp_9180a2 |
Cross-lane register transfer via simd_shuffle avoids constant memory entirely
|
failure |
1/5
|
decode_tok_s_8kvs_ceiling_pct
|
|
| cexp_926406 |
Named register variables with ternary select avoids both LUT and array indexing
|
failure |
1/5
|
decode_tok_s_8kvs_ceiling_pct
|
|
| cexp_92ae80 |
Clifford algebra rotor rotation is faster than dense matrix rotation
|
success |
1/5
|
speedup_cuda_1kspeedup_cuda_16kspeedup_metal_1kspeedup_metal_65kparameter_reduction
|
|
| cexp_93a2c9 |
Remapping turbo3 3-bit index to {mag_idx, sign_bit} halves register LUT pressure, improving decode
|
neutral |
1/5
|
decode_tg64_4kdecode_tg64_32kdecode_tg64_4k_baselinedecode_tg64_32k_baselineppl
|
|
| cexp_954f7f |
Replacing fixed TBQ_CHUNK=4096 with cudaMemGetInfo-based calculation enables better VRAM utilization
|
success |
1/5
|
ppl_2k_mmappl_2k_chunkedppl_8k_mmappl_8k_chunkedppl_32k_mmappl_32k_chunked
|
|
| cexp_96a193 |
turbo3 achieves near-q8_0 quality on Apple Silicon with 4.6x compression
|
success |
1/5
|
compression_ratioprefill_tok_sprefill_ratio_vs_q8decode_tok_s_2kdecode_ratio_vs_q8_2kdecode_tok_s_8kdecode_ratio_vs_q8_8kdecode_tok_s_32kdecode_ratio_vs_q8_32kperplexityraw_kv_kurtosispost_rotation_kurtosispost_rotation_stdexpected_stdstd_ratio
|
|
| cexp_991050 |
turbo3 enables full 128K context on 24GB GPU where q8_0 OOMs
|
success |
1/5
|
prefill_pp131072decode_tg64_128kvram_gb
|
|
| cexp_9ceeb7 |
Processing queries in batches reduces S buffer from O(nh_q*nq*chunk) to O(nh_q*q_batch*chunk)
|
success |
1/5
|
s_buffer_27b_70k_before_gbs_buffer_27b_70k_after_mbppl_32k
|
|
| cexp_a2f5f5 |
KVLinC claims rotation hurts keys — test K-only unrotated
|
negative |
1/5
|
ppl_both_rotatedppl_k_unrotated_v_rotatedppl_neither_rotated
|
|
| cexp_ac6e32 |
Moving backtrace from shared to global memory and double-buffering cost arrays reduces syncthreads in Viterbi encode
|
success |
1/5
|
decode_dense_baselinedecode_dense_optdecode_moe_baselinedecode_moe_optppl_baselineppl_optimized
|
|
| cexp_afba59 |
Skipping entire KV tiles when all QK scores are far below running max reduces V dequant work at long context
|
negative |
1/5
|
decode_baseline_2kdecode_skip_2kdecode_baseline_65kdecode_skip_65k
|
|
| cexp_b43ed6 |
Weighting codebook training by query-norm distribution (Q²) improves downstream KLD by optimizing for attention-weighted distortion
|
success |
1/5
|
||
| cexp_b7f864 |
Pure arithmetic (mul+add) to reconstruct centroids without any memory access
|
failure |
1/5
|
decode_tok_s_8kvs_ceiling_pct
|
|
| cexp_bbb376 |
Skip V dequant for negligible attention weights (exp(score-max) < 1e-6)
|
success |
1/5
|
decode_tg64_8k_beforedecode_tg64_8k_afterdecode_tg64_32kppl
|
|
| cexp_bd8d18 |
turbo3 quality generalizes to dense architectures on Apple Silicon
|
success |
1/5
|
compression_ratioperplexity
|