Sentinel invariant
GLYPH FM-index v0.x is built over:
raw_corpus + real appended 0x00 sentinel
The sentinel is a real byte appended to the indexed corpus before suffix array, BWT, and FM-index construction.
This invariant is required for byte-exact FM correctness.
Current v0.x constraint#
Input corpora must not contain 0x00.
The 0x00 byte is reserved as the unique terminal sentinel for the indexed corpus.
Why this matters#
The suffix array, BWT, and FM-index must describe the same byte sequence.
Using a synthetic sentinel during BWT construction without appending it to the corpus can create inconsistent FM intervals and undercount matches.
Canonical builder#
Use:
tools/build_glyph_index_v1.sh
Do not bypass the canonical builder for v0.x indexes.
Future direction#
A future index format may support arbitrary 0x00 bytes in input corpora by using a 257-symbol alphabet or an explicit out-of-band sentinel representation.