turbo3 enables full 128K context on 24GB GPU where q8_0 OOMs
27B model at full 128K context on 24GB — impossible with q8_0 (OOMs at ~65K). Decode speed constant at ~30 tok/s across ALL context lengths (4K through 128K). Prefill scales sub-linearly. Context length recommendations: up to 65K use LA-1 (best PPL -1.17%), 65K-128K use LA-5/uniform turbo3 (LA-1 OOMs), 128K+ use uniform turbo3 (4.9x compression). KV cache memory: turbo3 uses ~3.5 bits/element vs q8_0's 8 bits = 2.3x less VRAM for KV cache alone.