Layer-adaptive turbo3 (LA-1, first4+last4 q8_0) — TurboQuant KV Cache Optimization

Consensus Metrics

ppl 5.769 ± 0.165 (n=1, σ=0)

prefill_pp4096 1128 (n=1, σ=0)

decode_tg64 30.25 (n=1, σ=0)

compression_ratio 3.5 (n=1, σ=0)

Parameters

type_k turbo3

type_v turbo3

layer_adaptive 1

promoted_layers first4+last4

context 2048

chunks 8

Hypothesis

Promoting quality-sensitive layers to q8_0 improves PPL while maintaining compression

Tags

Subject

Model: Qwen3.5-27B-Q6_K Dataset: wikitext-2

Baseline Comparison

ppl -1.17% prefill -0.4% decode -2.5%

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

RECOMMENDED CONFIG for contexts up to 65K. 1.17% BETTER PPL than q8_0, 99.6% prefill speed, 97.5% decode speed, 3.5x compression. Implementation: TURBO_LAYER_ADAPTIVE=1 env var, first 4 + last 4 of n_layer use q8_0, rest use turbo3 for both K and V. Both K+V must be promoted together (asymmetric K-only or V-only promotion hurts due to norm correction mismatch). OOMs at 128K on 24GB — use LA-5 (first2+last2) for 128K.

View implementation →

ppl 5.769 prefill_pp4096 1128 decode_tg64 30.25 compression_ratio 3.5