Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.
Showing 87 experiments
| ID | Title / Hypothesis | Result | Confidence | Reproductions | Metrics |
|---|---|---|---|---|---|
| cexp_bf9572 |
Gemma 4's K=V shared projections interact differently with KV cache quantization due to correlated K/V errors
|
inconclusive |
1/5
|
kld_gemma4_q8kld_qwen35_q8
|
|
| cexp_c05db2 |
Processing 2 elements at once with half2 and ternary sign reduces address count
|
failure |
1/5
|
decode_tok_s_8kvs_ceiling_pct
|
|
| cexp_c08e7e |
Threshold choice across 5 orders of magnitude does not affect quality
|
neutral |
1/5
|
ppl_1e4ppl_1e5ppl_1e6ppl_1e7ppl_1e8
|
|
| cexp_c4ea9d |
Flipping computation to iterate over centroids and accumulate matching Q elements avoids per-element lookup
|
negative |
1/5
|
decode_tok_s_8kvs_ceiling_pct
|
|
| cexp_c9ca34 |
Storing first N tokens at fp16 improves PPL (sink tokens get disproportionate attention)
|
neutral |
1/5
|
ppl_no_sinkppl_4_sinksppl_8_sinksppl_16_sinks
|
|
| cexp_cfe4ca |
KV cache operations are negligible fraction of total decode compute at batch_size=1
|
success |
1/5
|
||
| cexp_d4375d |
Establish baseline PPL values for TBQ types on Qwen3.5-9B
|
baseline |
1/5
|
ppl_f16ppl_q8_0ppl_tbq4ppl_tbq3ppl_tbq2
|
|
| cexp_d6b923 |
turbo3 maintains retrieval accuracy with multiple distractor needles through 32K context
|
success |
1/5
|
retrieval_pct
|
|
| cexp_dd82bb |
Fully branchless FMA chain with zero memory access beats constant LUT
|
failure |
1/5
|
decode_tok_s_8kvs_ceiling_pct
|
|
| cexp_e22a96 |
Gemma-3 V cache broken because V un-rotation missing from iSWA build_attn overload
|
success |
1/5
|
ppl_q8_baselineppl_turbo3_kv_after_fixppl_turbo3_k_onlyppl_turbo3_kv_before_fix
|
|
| cexp_e426cb |
Fusing activation quantization into the weight GEMM kernel eliminates separate q8_1 quantize kernel launch + L2 traffic
|
negative |
1/5
|
decode_baselinedecode_f32_fmadecode_inreg_dp4adecode_smem_dp4a
|
|
| cexp_e5416b |
turbo3 with FWHT rotation + norm correction matches q8_0 quality
|
success |
1/5
|
pplppl_q8_baselinecompression_ratio
|
|
| cexp_e66662 |
QJL correction in turbo4 is unnecessary overhead
|
failure |
1/5
|
ppl_with_qjlppl_without_qjlprefill_without_qjlprefill_with_qjl
|
|
| cexp_e68bfa |
Alpha applied at decode time (scaling dequantized V) vs encode time (baked into fp16 norm before quantization) may give different results, with decode-time enabling context-adaptive deployment
|
success |
1/5
|
||
| cexp_ec2202 |
Better byte extraction with 8-entry LUT improves over baseline
|
failure |
1/5
|
decode_tok_s_8kvs_ceiling_pct
|
|
| cexp_ee1dd9 |
Per-channel scaling before FWHT reduces head_dim=128 quality gap by aligning channel variances
|
negative |
1/5
|
theoretical_gain
|
|
| cexp_f0f034 |
Subtract per-element mean before FWHT to improve quantization
|
negative |
1/5
|
ppl_turbo3_baselineppl_turbo3_dcppl_turbo4_baselineppl_turbo4_dc
|
|
| cexp_f2b4aa |
turbo3 KLD tracks bit rate, not implementation quality
|
neutral |
1/5
|
kld_moe_turbo3kld_moe_q4_0kld_moe_q8_0kld_dense_turbo3kld_dense_q4_0kld_dense_q8_0same_top_p_moe_turbo3same_top_p_dense_turbo3
|
|
| cexp_f85c20 |
Capturing decode step as CUDA Graph eliminates per-kernel launch overhead (hundreds of launches per token)
|
success |
1/5
|
decode_baselinedecode_cuda_graphs
|
|
| cexp_3e9a2e |
Asymmetric q8_0/tbq3 will maintain <5% prefill gap and decode parity across diverse model architectures
|
success |
1/5
|
||
| cexp_466df6 |
The fused kernel's end-to-end slowdown is amplified/masked by non-attention compute (FFN, norms, embeddings). By comparing f16 KV (zero dequant overhead) vs TBQ3 baseline vs TBQ3 fused, we can isolate the attention-only time and understand what fraction of total time is available for optimization.
|
success |
1/5
|
||
| cexp_59b1a9 |
Since the SRHT (H, diag(r)) is fixed across all tokens, we can pre-rotate Q with forward SRHT once, compute K scores directly against centroids (no K butterfly), accumulate V in compressed domain (no V butterfly), and apply ONE inverse SRHT at the end. This eliminates 14 butterfly stages per KV token from the inner loop, reducing per-token compute by ~8x.
|
success |
1/5
|
||
| cexp_6d957c |
Using absolute query positions (nkv-nq+q_start+q_len) for causal skip correctly prunes Q batches without affecting PPL, saving 38-47% compute during full prefill
|
success |
1/5
|
||
| cexp_70847d |
Adding head_dim=256 support to the fused TBQ3 attention kernel enables it to work with Qwen3.5-27B and Gemma-3 models. The kernel needs per-block SRHT (two independent 128-element butterflies for D=256) and FLOATS_PER_LANE=D/WARP_SIZE generalization.
|
success |
1/5
|
||
| cexp_c69dc2 |
Fusing TBQ3 dequant (inverse SRHT) directly into a FlashAttention-style online softmax kernel eliminates all intermediate buffers (k_tmp, v_tmp, S) while producing identical results
|
success |
1/5
|
||
| cexp_e7c22d |
Bulk-dequanting V to fp16 before MMA launch (instead of fusing V dequant into the tile loader) will close the pp8192 gap because V can use the standard cp.async.cg pipeline
|
success |
1/5
|
pp8192_beforepp8192_after
|
|
| cexp_e7e920 |
The fused kernel's Q addressing was swapped — it used q_head * (nq * D) instead of q_idx * (nh_q * D). After ggml_permute(0,2,1,3), Q physical layout is [nq, nh_q, D] (token-major) not [nh_q, nq, D] (head-major). Fixing this should make PPL match baseline.
|
success |
1/5
|