GLYPH Release State — v0.2
Status:
- experimental
- deterministic retrieval prototype
- segmented retrieval validated
Validated capabilities#
The following paths were validated on real corpora:
- suffix-array construction
- BWT construction
- FM-index construction
- persistent exact retrieval
- segmented shard querying
- deterministic count merging
Validated datasets include:
- HDFS logs
- synthetic mini corpora
Canonical invariant#
GLYPH FM-index v0.x requires:
indexed_corpus = raw_corpus + appended real 0x00 sentinel
Canonical pipeline:
raw corpus
-> prepare_sentinel_corpus_v1.py
-> build_sa_u32
-> build_bwt
-> build_fm
Current guarantees#
GLYPH currently guarantees:
- deterministic exact byte retrieval
- exact suffix-array interval semantics
- deterministic shard merge behavior
- byte-exact counting
Current limitations#
Current v0.x limitations:
- no incremental indexing
- immutable corpora assumption
- no fuzzy matching
- no ranking
- no regex engine
- no semantic search
- no arbitrary 0x00 corpora support
Segmented retrieval status#
Segmented retrieval correctness was validated after fixing sentinel semantics.
Validated result:
- shard-local FM counts match Python ground truth
- merged counts match global corpus truth
Performance model#
GLYPH trades:
- offline indexing cost
- RAM usage
- large index artifacts
for:
- extremely fast repeated exact retrieval
GLYPH is optimized for:
- repeated deterministic queries
- static corpora
- forensic retrieval
- infrastructure-scale exact lookup
Non-goals#
GLYPH is currently NOT intended to be:
- Elasticsearch replacement
- vector database
- semantic retrieval engine
- approximate nearest-neighbor system
- ranking engine
Stability#
Binary formats and APIs may still evolve during v0.x development.
Backward compatibility is not yet guaranteed.