Replacing fixed TBQ_CHUNK=4096 with cudaMemGetInfo-based calculation enables better VRAM utilization
Adaptive sizing picks chunk=8192 with 14GB free. Chunked path slightly better than MMA (FP32 vs FP16 accumulation). PPL matches MMA path within 0.05 at all contexts tested.