turbo dequant-to-fp16 kernels ignore stream dimension ne[3], causing catastrophic PPL with n_seq > 1
CRITICAL BUG. Turbo dequant-to-fp16 kernels in fattn.cu ignored ne[3] (stream dimension). With kv_unified=false (default) and n_seq > 1, K/V tensors have ne[3] = n_stream during prefill. Only stream 0 was allocated and dequanted — streams 1+ read uninitialized fp16 garbage. Fix: added ne[3]/nb[3] to kernel signatures, allocation sizes, and 3D grid launches for all turbo dequant kernels (turbo3, turbo4) in both prefill and decode paths. n_seq=1 unchanged, n_seq=2 fixed from 17.10 to 6.30, n_seq=4 fixed from 22.56 to 6.34.
Quality comparison is noisy across context lengths. Error bars +-0.16-0.18 are larger than the measured differences (0.03-0.09 PPL). turbo3 generally competitive with q8_0 at all context lengths. The PPL increase from 2K to 8K is a data effect (later wikitext text is harder to predict), not quantization degradation — both turbo3 and q8_0 show the same pattern. turbo3 uniform slightly better than LA-1 at 8K (7.3783 vs 7.3952), both within error bars.