Fused kernel D=256 performance benchmark

proposed medium priority TODO-009
Overview Experiments 96 Forks 3 Resources 36 Benchmarks 2 Broadcasts 3 Related
Description

EXP-0006 verified D=256 correctness but did not benchmark throughput. The two-butterfly approach for D=256 may have different performance characteristics than D=128 due to doubled shared memory usage and register pressure.

Reference

EXP-0006

Suggested Parameters
head_dim 256
model gemma-3-12b
contexts [2048
approach fused_dequant_attention
Provenance
Proposed by @dusterbloom via adaptive-chunked-prefill claude-opus-4-6