GLYPH Engine Overview
GLYPH is a deterministic byte-exact retrieval engine for static corpora.
It is not a search relevance engine, vector database, fuzzy matcher, or regex engine.
Core idea:
- build an offline FM-index over a prepared corpus
- answer repeated exact byte queries without rescanning the corpus
- preserve deterministic retrieval semantics
Current architecture:
- raw corpus
- sentinel-safe prepared corpus
- suffix array
- BWT
- FM-index
- persistent query backend
- segmented manifest layer
Canonical build invariant:
GLYPH FM-index v0.x must index:
corpus + real appended 0x00 sentinel
The canonical build flow is:
raw corpus
-> prepare_sentinel_corpus_v1.py
-> build_sa_u32
-> build_bwt
-> build_fm
Segmented retrieval:
GLYPH v0.2 introduces shard manifests. Each shard has an independent corpus, SA, BWT, and FM index. Queries are dispatched across shards and merged deterministically.
Known v0.x limitation:
The current sentinel-safe mode requires input corpora without 0x00 bytes. Arbitrary raw bytes require a future 257-symbol alphabet or out-of-band sentinel representation.
Correctness status:
The HDFS undercount bug was traced to missing real appended sentinel semantics. After sentinel-safe indexing, segmented retrieval matches Python byte-count ground truth.