OCC BREAKPOINT ANALYSIS V1
Date:
2026-05-22
Machine:
AMD EPYC 4344P
Benchmark:
OCC_STEP_BENCH_V1
Dataset:
mini BWT
Iterations:
100000
Measured result:
scan_len 16
scalar p50 20ns simd_avx2 p50 20ns
scan_len 32
scalar p50 20ns simd_avx2 p50 20ns
scan_len 64
scalar p50 20ns simd_avx2 p50 20ns
scalar p95 30ns simd_avx2 p95 20ns
scan_len 128
scalar p50 30ns simd_avx2 p50 20ns
scan_len 256
scalar p50 30ns simd_avx2 p50 20ns
scan_len 512
scalar p50 40ns simd_avx2 p50 20ns
scan_len 1024
scalar p50 50ns simd_avx2 p50 30ns
Core finding:
AVX2 byte-compare Occ becomes clearly useful around scan_len 64-128 bytes.
For scan_len <= 32 bytes, scalar and AVX2 are effectively tied.
Interpretation:
Current GLYPH mini FM layout has avg_scan_bytes about 28.
Therefore, current checkpoint density already keeps Occ scans below the main AVX2 breakeven point.
This explains why AVX2 did not significantly improve p50 in real Occ benchmark.
Important conclusion:
SIMD was not useless.
SIMD revealed that the current checkpoint layout is already latency-oriented.
Layout implication:
checkpoint_step 32
latency profile
expected scan below 64 bytes
SIMD optional
checkpoint_step 128
balanced profile
SIMD begins to matter
checkpoint_step 256+
compact / memory-saving profile
SIMD recommended or required
Future direction:
Do not jump directly to AVX512.
Next real experiment:
measure checkpoint_step vs:
- FM size
- avg_scan_bytes
- scalar latency
- AVX2 latency
- memory footprint
Strategic meaning:
checkpoint_step is not only a builder parameter.
checkpoint_step is part of a future Layout Contract.
Possible future layout profiles:
latency
balanced
compact
Byte-layout ceiling:
Current SIMD approach uses byte comparison:
BWT[i] == symbol
Real SIMD-native FM layout may require bit-plane / strided representation.
But bit-plane layout is deferred.
First exhaust byte-layout behavior.