Lloyd-Max codebook quantization for LLM KV caches. 3-bit (turbo3) and 4-bit (turbo4) with FWHT rotation and norm correction. Beats q8_0 quality at 3-5x compression. Research focus: closing the head_dim=128 quality gap, decode speed on MoE models, and exploring CAT/SQuat/InnerQ techniques.
Showing 87 experiments
| ID | Title / Hypothesis | Result | Confidence | Reproductions | Metrics |
|---|---|---|---|---|---|
| cexp_015c90 |
turbo4 K produces garbage because Q pre-rotation guard only checks TURBO3_0, not TURBO4_0
|
success |
1/5
|
ppl_turbo4_kv_qwen35_27b_fixedppl_turbo4_kv_qwen3_14b_fixedppl_turbo4_kv_qwen3_14b_brokenppl_turbo4v_qwen3_14b
|
|
| cexp_01f12d |
turbo3 at 3.5 bpv competes with upstream rotated q4_0 at 4.5 bpv
|
success |
1/5
|
ppl_f16_baselineppl_turbo3_2kppl_q4_0_rot_2k
|
|
| cexp_028768 |
Forcing non-vectorized FA kernel (nl=2) improves single-token decode
|
failure |
1/5
|
decode_tok_s_8kvs_ceiling_pct
|
|
| cexp_09836a |
Reading turbo3 directly in VEC attention kernel (scalar dequant, no fp16 buffer) saves 5x bandwidth by avoiding fp16 materialization
|
negative |
1/5
|
decode_fp16_mmadecode_native_vec
|
|
| cexp_0b0172 |
turbo dequant-to-fp16 kernels ignore stream dimension ne[3], causing catastrophic PPL with n_seq > 1
|
success |
1/5
|
ppl_nseq1_beforeppl_nseq1_afterppl_nseq2_beforeppl_nseq2_afterppl_nseq4_beforeppl_nseq4_afterppl_2k_8c_la1ppl_2k_8c_q8ppl_4k_4c_la1ppl_4k_4c_q8ppl_8k_4c_la1ppl_8k_4c_q8ppl_8k_4c_uniform
|
|
| cexp_10bd78 |
Accepting ~1% PPL regression from fp16 round-trip enables 1.9x prefill speedup for turbo4
|
success |
1/5
|
prefill_pp4096_beforeprefill_pp4096_afterppl_full_precisionppl_fp16_prefilldecode_tg64
|
|
| cexp_117f61 |
Auto-detect (max_scale_ratio < 1.2 → disable) prevents InnerQ from hurting well-balanced head_dim=256 distributions
|
success |
1/5
|
pplppl_turbo3_baselineppl_innerq_forcedmax_scale_ratio_detected
|
|
| cexp_1828c2 |
Codebooks worse at short context become better at long context due to CLT averaging in attention
|
success |
1/5
|
ppl_compiled_2kppl_finetuned_2kppl_compiled_32kppl_finetuned_32k
|
|
| cexp_1c3f17 |
Loading 2KB TCQ codebook from __constant__ memory into shared memory gives 32-bank parallel access, improving decode throughput
|
negative |
1/5
|
decode_constant_2kdecode_smem_2kdecode_constant_32kdecode_smem_32k
|
|
| cexp_1ff7e8 |
Chunked cuBLAS GEMM for Q*K and score*V during prefill outperforms fused flash attention MMA
|
negative |
1/5
|
prefill_fused_mma
|
|
| cexp_22b2b2 |
Greedy trellis encode (locally optimal at each step) trades quality for prefill speed
|
negative |
1/5
|
ppl_greedyppl_multi_start_512ppl_viterbi
|
|
| cexp_239fdb |
TCQ trellis structure induces autocorrelation in quantization errors that could be exploited or must be modeled
|
negative |
1/5
|
lag1_autocorrelationlag2_autocorrelationlag4_autocorrelation
|
|
| cexp_283634 |
2-bit PolarQuant (4-centroid Lloyd-Max) provides maximum compression for VRAM-constrained scenarios
|
inconclusive |
1/5
|
ppl_turbo2_uniformppl_turbo3k_turbo2vppl_turbo2k_turbo3vppl_turbo2k_q8vppl_turbo2_la1ppl_turbo2_la2kv_memory_4k_mibcompression_vs_fp16decode_tg128
|
|
| cexp_28ca71 |
turbo3 on draft model KV saves VRAM and maintains acceptance rate
|
neutral |
1/5
|
throughput_q8_draftthroughput_turbo3_draftn_drafted_q8n_drafted_turbo3normal_decode
|
|
| cexp_2a99b4 |
CUDA implementation on Blackwell achieves near-parity with F16 decode
|
success |
1/5
|
decode_tok_s_f16decode_tok_s_q8_0decode_tok_s_turbo3decode_ratio_vs_f16compression_ratio
|
|
| cexp_2db704 |
PPL-optimal alpha may not be KLD-optimal — the two metrics may disagree on the best operating point
|
success |
1/5
|
ppl_optimal_alphakld_optimal_alpha
|
|
| cexp_2fe356 |
Bulk dequant turbo KV to fp16 then use MMA tensor core kernel for prefill
|
success |
1/5
|
prefill_pp4096_beforeprefill_pp4096_afterprefill_ratio_vs_q8decode_tg64ppl
|
|
| cexp_356f07 |
Scaling V norm by constant alpha after quantization compensates for systematic shrinkage and improves quality
|
success |
1/5
|
||
| cexp_3683be |
ncu profiling reveals whether MMVQ weight GEMM kernel has optimization headroom
|
success |
1/5
|
registers_per_threadactive_warps
|
|
| cexp_39282e |
turbo3 quality generalizes across architectures
|
negative |
1/5
|
||
| cexp_392e0e |
Promoting last 8 layers to q8_0 improves PPL while maintaining most turbo3 compression
|
success |
1/5
|
ppl_la2_turbo3ppl_la2_turbo4decode_tg64_la2_turbo3decode_tg64_la2_turbo4compression_ratio
|
|
| cexp_3c4375 |
Multi-GPU performance matches single-GPU ratios
|
inconclusive |
1/5
|
prefill_turbo3prefill_q8decode_turbo3decode_q8prefill_turbo4decode_turbo4
|
|
| cexp_3e59be |
turbo4 (3-bit + QJL correction) beats q8_0 on head_dim=256
|
success |
1/5
|
pplppl_q8_baselinecompression_ratio
|
|
| cexp_3eeb92 |
Madreag CUDA fork maintains decode performance at long context
|
negative |
1/5
|
decode_tg128_0k_q8decode_tg128_0k_turbo3decode_ratio_0kdecode_tg128_4k_q8decode_tg128_4k_turbo3decode_ratio_4kdecode_tg128_8k_q8decode_tg128_8k_turbo3decode_ratio_8kdecode_tg128_16k_q8decode_tg128_16k_turbo3decode_ratio_16kdecode_tg128_204k_q8decode_tg128_204k_turbo3decode_ratio_204kprefill_ratio_0kprefill_ratio_16kprefill_ratio_204kkv_mem_q8_total_mibkv_mem_turbo3_total_mibkv_compression_ratio
|
|
| cexp_42fdf4 |
Measure theoretical maximum decode speed with dequant disabled (returns zeros)
|
baseline |
1/5
|
decode_tok_s_8kdecode_ratio_vs_q8
|
|
| cexp_48061c |
TurboQuant generalizes to vLLM inference framework on multi-GPU CUDA
|
success |
1/5
|
kv_savings_pctniah_scoreniah_totalprefill_tok_s_minprefill_tok_s_maxdecode_tok_s_shortdecode_tok_s_131kkv_cache_mb_tq_131kkv_cache_mb_baseline_131kfull_attn_layers_compressioncosine_sim_3bit_keyscosine_sim_2bit_valuescosine_sim_4bit_values
|
|
| cexp_4d9a59 |
Replacing all LUT reads with a select/ternary chain eliminates constant memory entirely
|
failure |
1/5
|
decode_tok_s_8kvs_ceiling_pct
|
|
| cexp_4f2e9a |
MSE-only approach (no QJL) at 2-bit produces character-identical output to fp16
|
success |
1/5
|
||
| cexp_53ebee |
Tuning split-K parallel_blocks parameter improves turbo decode throughput
|
neutral |
1/5
|
decode_pb_autodecode_pb1decode_pb2decode_pb4decode_pb8decode_pb16decode_pb32decode_q8_baseline
|
|
| cexp_558c28 |
FWHT rotation is essential for turbo3 quality
|
conflict |
1/5
|
ppl_with_rotationppl_without_rotation_with_normppl_without_rotation_without_normppl_q8_baselineppl_la3_last4ppl_la4_first4ppl_la5_first2_last2compression_ratio_4layersppl_mode6_vonly_last8ppl_mode7_konly_last8ppl_mode8_vonly_2plus2ppl_mode2_both_last8
|