OCC SIMD HYPOTHESIS V1

Baseline scalar: EPYC4344P p50 20ns p95 30ns p99 30ns

Observation:

manual unroll reduces tail latency

Hypothesis:

AVX2 compare+movemask+popcnt may reduce p50

candidate:

_loadu_si256 _cmpeq_epi8 _movemask_epi8 _popcnt

goal:

p50 < 20ns