TurboQuant KV Cache Optimization

Experiments

ID	Title	Result	Metrics	Date
EXP-0046	Context-adaptive decode-time V alpha	success	improvement_vs_fixed_8k "2-7%" improvement_vs_fixed_32k "2-7%" runtime_cost "zero" alpha_3bit_2k 1.022 alpha_3bit_32k 1.002 alpha_2bit_2k 1.039 alpha_2bit_32k 1.094 +4 more	1mo ago
EXP-0047	Weight GEMM dominance — attention fraction analysis	success	flash_attn_pct "0.6%" mmvq_pct "83.6%" mmvq_bw_utilization "88-94%" kv_quant_speed_impact "4-5%" weight_quant_speed_impact "25%" +2 more	1mo ago
EXP-0050	Skip Softmax — tile-level attention skipping (speed)	negative	decode_baseline_2k 29.9 decode_skip_2k 29.9 decode_baseline_65k 29.8 decode_skip_65k 29.8 attn_fraction_32k "0.6%" +2 more	1mo ago
EXP-0052	Fused weight quantize + GEMM (speed)	negative	decode_baseline 30.0 decode_f32_fma 24.82 decode_inreg_dp4a 20.57 decode_smem_dp4a 8.7 +1 more	1mo ago
EXP-0057	MMVQ kernel profiling — hardware bandwidth wall	success	bw_utilization_large "88-94%" bw_utilization_small "50%" registers_per_thread 40 active_warps 46 bottleneck "DRAM bandwidth" +2 more	1mo ago
EXP-0049	TCQ codebook → shared memory (speed)	negative	decode_constant_2k 29.9 decode_smem_2k 29.9 decode_constant_32k 29.7 decode_smem_32k 29.7 +1 more	1mo ago
EXP-0051	Greedy TCQ encode (speed)	negative	prefill_greedy_delta "+8%" ppl_greedy 17.09 ppl_multi_start_512 14.74 ppl_viterbi 5.83 +1 more	1mo ago
EXP-0053	CUDA Graphs for decode pipeline (speed)	success	decode_baseline 29.9 decode_cuda_graphs 30.83	1mo ago
EXP-0054	Viterbi double-buffered cost + global backtrace (speed)	success	decode_dense_baseline 29.56 decode_dense_opt 29.71 decode_moe_baseline 126.22 decode_moe_opt 126.97 ppl_baseline 6.2186 ppl_optimized 6.2186 smem_reduction "35KB→5KB per block" +4 more	1mo ago
EXP-0055	Native VEC decode (scalar dequant in attention kernel)	negative	decode_fp16_mma 30.0 decode_native_vec 29.7	1mo ago
EXP-0056	cuBLAS GEMM for flash attention prefill	negative	prefill_fused_mma 1125 prefill_cublas "1-5% slower"	1mo ago
EXP-0045	Gemma 4 architecture — K=V quantization sensitivity	unknown	ppl_turbo3_k_only_delta "-1.7%" ppl_turbo3_v_only_delta "+70%" kld_gemma4_q8 0.509 kld_qwen35_q8 0.005 kld_ratio "110x" +2 more	1mo ago
EXP-0048	Encode-time vs decode-time alpha	success	kld_encode_2k "better" kld_decode_8k "better (-3.9% vs encode)"	1mo ago
EXP-0042	TCQ error autocorrelation	negative	lag1_autocorrelation -0.007 lag2_autocorrelation 0.003 lag4_autocorrelation -0.001	1mo ago
EXP-0043	PPL vs KLD divergence — alpha optimization	success	ppl_optimal_alpha 1.2 kld_optimal_alpha 1.04	1mo ago
EXP-0044	Product-aware codebook training (Q²-weighted GLA)	success	kld_improvement_3bit_2k "7.2%" kld_improvement_3bit_8k "9.8%" kld_improvement_2bit_2k "12.8%"	1mo ago
EXP-0041	V norm scaling (temperature scaling)	success	ppl_improvement_64k "-11.8%" v_contribution_ratio "6.5x vs K scaling"	1mo ago
EXP-0038	TCQ codebook GLA optimization — MSE-PPL divergence	negative	mse_reduction_50iter "52.8%" mse_reduction_200iter "54.7%" ppl_50iter 5.83 ppl_200iter 5.91 ppl_50iter_delta "+0.13%" ppl_200iter_delta "+1.47%" +3 more	1mo ago
EXP-0039	TurboQuant vs rotated q4_0/q8_0 (upstream PR #21038)	success	ppl_f16_baseline 5.8048 ppl_turbo3_2k 5.8501 ppl_q4_0_rot_2k 5.8578 ppl_turbo3_65k_delta "+0.53%" ppl_q4_0_rot_65k_delta "+1.73%" +2 more	1mo ago
EXP-0040	Context-length crossover for TCQ codebooks	success	ppl_compiled_2k 5.827 ppl_finetuned_2k 5.841 ppl_compiled_32k 7.098 ppl_finetuned_32k 7.053 crossover_context "~8K" +2 more	1mo ago
EXP-0035	InnerQ per-channel equalization (head_dim=128)	success	ppl 6.5349 ppl_q8_baseline 6.4206 ppl_turbo3_no_innerq 6.634 gap_closure 46% +1 more	1mo ago
EXP-0036	InnerQ auto-detect on head_dim=256	success	ppl 5.8501 ppl_turbo3_baseline 5.8501 ppl_innerq_forced 5.9283 max_scale_ratio_detected 1.164 +1 more	1mo ago
EXP-0037	Trellis-Coded Quantization for KV cache	success	ppl_turbo3_tcq 5.827 ppl_turbo2_tcq 6.055 ppl_turbo2_scalar 15.61 ppl_q8_baseline 5.838 prefill_delta "-21%" decode_delta "-5%" +3 more	1mo ago
EXP-0002	turbo4 baseline quality (head_dim=256)	success	ppl 5.8186 ppl_q8_baseline 5.8375 compression_ratio 4.25	1mo ago
EXP-0004	turbo3 quality on head_dim=128 models	negative	ppl_delta_qwen35_27b_hd256 "+0.2%" ppl_delta_qwen35_35b_moe_hd256 "+0.3%" ppl_delta_mn_violet_12b_hd128 "+2.6%" ppl_delta_qwen3_14b_hd128 "+3.8%" ppl_delta_gemma3_27b_hd128 "+3.3%" +2 more	1mo ago
EXP-0007	Sparse V dequant (credit: TheTom)	success	decode_tg64_8k_before 114.44 decode_tg64_8k_after 126.89 decode_tg64_32k 126.21 ppl 5.8501 +1 more	1mo ago
EXP-0008	GSR Walsh ordering	neutral	ppl_baseline 5.8323 ppl_walsh 5.8248	1mo ago
EXP-0009	KVLinC asymmetric K/V rotation	negative	ppl_both_rotated 5.8323 ppl_k_unrotated_v_rotated 6.1647 ppl_neither_rotated 6.2357	1mo ago
EXP-0010	Attention sink token protection	neutral	ppl_no_sink 5.8501 ppl_4_sinks 5.8246 ppl_8_sinks 5.8506 ppl_16_sinks 5.8894 +1 more	1mo ago
EXP-0011	NSNQuant per-token DC removal	negative	ppl_turbo3_baseline 5.8501 ppl_turbo3_dc 5.8827 ppl_turbo4_baseline 5.8186 ppl_turbo4_dc 17.4134 +1 more	1mo ago
EXP-0012	MSE-optimal norm correction	negative	ppl_l2_preserving 5.8501 ppl_mse_optimal 5.9083	1mo ago
EXP-0014	Drop QJL from turbo4	failure	ppl_with_qjl 5.8186 ppl_without_qjl 5.8501 prefill_without_qjl 1124 prefill_with_qjl 588 +1 more	1mo ago
EXP-0015	Channel reordering before FWHT	neutral		1mo ago
EXP-0016	ButterflyQuant learnable rotation	neutral		1mo ago
EXP-0017	AQUA-KV inter-layer prediction	neutral		1mo ago
EXP-0018	PatternKV pattern subtraction	neutral		1mo ago
EXP-0019	Dual RTX 4090 multi-GPU validation	inconclusive	prefill_turbo3 4987 prefill_q8 5117 decode_turbo3 36.27 decode_q8 103.57 prefill_turbo4 2542 decode_turbo4 17.63 +3 more	1mo ago
EXP-0024	turbo2 (2-bit) variant	unknown	ppl_turbo2_uniform 6.7792 ppl_turbo3k_turbo2v 6.567 ppl_turbo2k_turbo3v 6.5203 ppl_turbo2k_q8v 6.4894 ppl_turbo2_la1 6.7411 ppl_turbo2_la2 6.6866 kv_memory_4k_mib 20 compression_vs_fp16 6.4 decode_tg128 30.64 +6 more	1mo ago
EXP-0025	Gemma-3 SWA V cache bug fix	success	ppl_q8_baseline 5.6995 ppl_turbo3_kv_after_fix 5.8867 ppl_turbo3_k_only 5.9633 ppl_turbo3_kv_before_fix 45000000000000 +1 more	1mo ago
EXP-0026	turbo4 K Q pre-rotation bug fix	success	ppl_turbo4_kv_qwen35_27b_fixed 5.8186 ppl_turbo4_kv_qwen3_14b_fixed 6.9118 ppl_turbo4_kv_qwen3_14b_broken 32643 ppl_turbo4v_qwen3_14b 6.6232 +1 more	1mo ago
EXP-0027	parallel_blocks tuning for turbo decode	neutral	decode_pb_auto 29.95 decode_pb1 29.97 decode_pb2 29.95 decode_pb4 29.95 decode_pb8 29.96 decode_pb16 29.96 decode_pb32 29.93 decode_q8_baseline 30.81 +5 more	1mo ago
EXP-0028	CAT alignment correction analysis	negative	theoretical_gain 0	1mo ago
EXP-0029	Multi-sequence (n_seq > 1) dequant fix	success	ppl_nseq1_before 6.31 ppl_nseq1_after 6.31 ppl_nseq2_before 17.1 ppl_nseq2_after 6.3 ppl_nseq4_before 22.56 ppl_nseq4_after 6.34 +3 more	1mo ago
EXP-0030	turbo4 prefill MMA (fp16 dequant tradeoff)	success	prefill_pp4096_before 588 prefill_pp4096_after 1113 ppl_full_precision 5.8186 ppl_fp16_prefill 5.8966 decode_tg64 29.66 +2 more	1mo ago
EXP-0031	Speculative decoding with turbo KV	neutral	throughput_q8_draft 28.78 throughput_turbo3_draft 28.85 n_drafted_q8 1864 n_drafted_turbo3 1936 normal_decode 31 +2 more	1mo ago
EXP-0032	Sign+magnitude encoding for turbo3 dequant	neutral	decode_tg64_4k 30.05 decode_tg64_32k 29.91 decode_tg64_4k_baseline 30.04 decode_tg64_32k_baseline 29.83 ppl 5.8501 +2 more	1mo ago
EXP-0033	Long-context PPL validation	success	ppl_2k_8c_la1 5.769 ppl_2k_8c_q8 5.8375 ppl_4k_4c_la1 6.3198 ppl_4k_4c_q8 6.2677 ppl_8k_4c_la1 7.3952 ppl_8k_4c_q8 7.4241 ppl_8k_4c_uniform 7.3783 +4 more	1mo ago
EXP-0034	SQuat query-orthogonal codebook (dropped)	neutral		1mo ago
EXP-0001	turbo3 baseline quality (head_dim=256)	success	ppl 5.8323 ppl_q8_baseline 5.8375 compression_ratio 4.9	1mo ago
EXP-0003	Layer-adaptive turbo3 (LA-1, first4+last4 q8_0)	success	ppl 5.769 prefill_pp4096 1128 decode_tg64 30.25 compression_ratio 3.5 +1 more	1mo ago
EXP-0005	FWHT rotation ablation	success	ppl_with_rotation 5.8323 ppl_without_rotation_with_norm 6.2357 ppl_without_rotation_without_norm 6.5249 ppl_q8_baseline 5.8375 +1 more	1mo ago
EXP-0006	Prefill dequant+MMA optimization	success	prefill_pp4096_before 631 prefill_pp4096_after 1121 prefill_ratio_vs_q8 0.988 decode_tg64 30.1 ppl 5.8501 +2 more	1mo ago
EXP-0013	128K context on 24GB GPU	success	prefill_pp131072 669 decode_tg64_128k 29.85 vram_gb 23.5 q8_0_fits false +1 more	1mo ago
EXP-0020	Layer-adaptive mode 2 (last 8 layers q8_0)	success	ppl_la2_turbo3 5.814 ppl_la2_turbo4 5.8077 decode_tg64_la2_turbo3 29.98 decode_tg64_la2_turbo4 29.69 compression_ratio 3.5 +2 more	1mo ago
EXP-0021	Layer-adaptive modes 3, 4, 5 isolation tests	success	ppl_la3_last4 5.8091 ppl_la4_first4 5.8211 ppl_la5_first2_last2 5.8091 compression_ratio_4layers 4.2 +1 more	1mo ago
EXP-0022	Asymmetric K/V type combinations	unknown	ppl_turbo4k_q8v 5.8451 ppl_q8k_turbo3v 5.8451 ppl_turbo4k_turbo3v 5.8653 ppl_turbo3k_turbo4v 5.8212 decode_q8k_turbo3v 30.32 decode_turbo4k_q8v 30.15 +3 more	1mo ago
EXP-0023	Layer-adaptive + asymmetric combined	negative	ppl_mode6_vonly_last8 5.839 ppl_mode7_konly_last8 5.839 ppl_mode8_vonly_2plus2 5.833 ppl_mode2_both_last8 5.814 +1 more	1mo ago

Owner	buun
GPU	RTX 3090 (24 GB VRAM)
Model	claude-opus-4-6
Created	1mo ago