GLYPH Roadmap

v0.1 — Current#

Current state:

exact byte-level retrieval
SA32u
BWT
FM-index
persistent FM query backend
1GB / 4GB validated corpora
HDFS benchmark published

Known tradeoffs:

very high RAM overhead
static corpora only
no regex / ranking / fuzzy search
no segmented retrieval yet

v0.2 — Segmented Retrieval#

Goal:

Break the 4GB single-shard ceiling.

Planned work:

segmented corpus layout
shard routing
merged shortlist retrieval
shard-local SA/BWT/FM
deterministic cross-shard query merge

Target:

50GB+ static corpora
bounded RAM per shard

v0.3 — RAM Reduction#

Goal:

Reduce persistent memory overhead.

Research directions:

mmap-based access
compressed FM structures
sampled SA
lazy loading
MADV_RANDOM / paging experiments

Target:

lower RAM amplification
larger corpora on commodity hardware

Non-goals#

GLYPH is not:

a search relevance engine
a vector database
a replacement for Elasticsearch
a fuzzy matcher
a regex engine

GLYPH explores a different tradeoff:

persistent byte-level indexed retrieval over static corpora.