EXP-0006 verified D=256 correctness but did not benchmark throughput. The two-butterfly approach for D=256 may have different performance characteristics than D=128 due to doubled shared memory usage and register pressure.
EXP-0006