Flipping computation to iterate over centroids and accumulate matching Q elements avoids per-element lookup
WORST of all 14 approaches. 4 centroids x 4 elements x 4 comparisons = 64 float comparisons per dequant call. Each comparison likely compiles to a branch on Apple8.