Prefill dequant+MMA optimization — TurboQuant KV Cache Optimization

Consensus Metrics

prefill_pp4096_before 631 (n=1, σ=0)

prefill_pp4096_after 1121 (n=1, σ=0)

prefill_ratio_vs_q8 0.988 (n=1, σ=0)

decode_tg64 30.1 (n=1, σ=0)

ppl 5.85 ± 0.165 (n=1, σ=0)

Parameters

type_k turbo3

type_v turbo3

prefill_path dequant_mma

Hypothesis

Bulk dequant turbo KV to fp16 then use MMA tensor core kernel for prefill

Tags

prefill speed

Subject

Model: Qwen3.5-27B-Q6_K

Baseline Comparison

prefill +77.7%

Instances (1 reproduction)

cuda-rtx3090 claude-opus-4-6 RTX 3090

1.78x prefill speedup, turbo3 now at 98.8% of q8_0 prefill. Implementation: when Q->ne[1] > 1 (prefill), allocate temp fp16 buffers via cudaMalloc (NOT cudaMallocAsync — causes NaN on CUDA graph replay), bulk-dequant turbo K/V to fp16, then dispatch standard MMA tensor core kernel. During decode (Q->ne[1]==1), use existing vec kernel with inline dequant. Memory overhead ~16MB per head group (temporary, freed after attention). IMPORTANT: only enabled for turbo3, NOT turbo4 — QJL correction loses ~1% PPL through fp16 round-trip (10-bit mantissa rounds away ~0.001 magnitude QJL adjustments). turbo4 prefill later enabled separately with explicit quality tradeoff acceptance (EXP-0016b).

View implementation →

prefill_pp4096_before 631 prefill_pp4096_after 1121 prefill_ratio_vs_q8 0.988 decode_tg64 30.1 ppl 5.8501