GLYPH Scaling Limits

Current state (v0.x)#

GLYPH v0.x uses SA32: suffix array stored as uint32.

Hard format limit:

corpus ≤ 4,294,967,295 bytes (~4 GiB)

Above this: SA32 overflow → silent corruption. No overflow detection exists yet. This is a known gap.

RAM economics#

Plain FM-index without compression requires approximately:

SA32:       4 bytes/symbol
BWT:        1 byte/symbol
FM index:   ~4 bytes/symbol (checkpoints + Occ)
corpus:     1 byte/symbol
─────────────────────────────
total:      ~9-10× corpus size in RAM

Observed on benchmark machine (HDFS 1GB):

corpus:    1.0 GB
total RAM: ~9.4 GB
ratio:     9.4×

Practical limits on current machine#

Benchmark machine: AMD EPYC 4344P, 118 GB RAM available.

corpus  5 GB  →  ~47 GB RAM   comfortable
corpus 10 GB  →  ~94 GB RAM   feasible
corpus 12 GB  →  ~113 GB RAM  near limit
corpus  4 GB  →  SA32 format hard ceiling

The binding constraint below 12 GB corpus is SA32 format, not RAM.

Scaling ladder#

Step 1 — SA32 stable path (current)#

works today
corpus limit: ~4 GiB hard
RAM limit: ~12 GB corpus on benchmark machine
status: complete

Step 2 — SA64 path (next)#

suffix array stored as uint64
removes 4 GiB hard ceiling
unlocks corpus 4–12 GB on current machine
same RAM ratio (~9.4×)
requires: builder changes, query binary changes, format versioning
status: designed in SA64_DESIGN.md, not yet implemented

Step 3 — Segmented SA64#

multiple SA64 shards
fan-out query across shards
unlocks corpus beyond single machine RAM
cross-shard boundary matches still not detected
status: planned

Step 4 — Compressed / sampled SA#

sampled suffix array (every k-th entry stored)
wavelet tree for Occ table
reduces RAM ratio from ~9.4× to ~2-3×
unlocks corpus 50–100+ GB on reasonable hardware
locate cost increases by O(k) LF steps
status: research

Why SA64 before compressed SA#

Compressed SA is more complex and changes the correctness model. SA64 is a format change with the same algorithm.

SA64 on current machine unlocks:

corpus up to ~12 GB (RAM bound)
multi-shard corpus beyond 4 GB per shard

Compressed SA becomes necessary only when:

corpus exceeds available RAM / 9.4
or RAM economics become the primary constraint

On a 118 GB machine, that threshold is ~12 GB corpus. Below that, SA64 is sufficient.

Known gaps before SA64#

Before implementing SA64:

SA32 overflow detection at build time (hard error, not silent)
index format versioning (magic bytes + version field)
corpus hash in index header (stale index detection)

These must exist in SA32 path first. SA64 inherits the same format discipline.

Summary#

Limit	Value	Cause
SA32 corpus ceiling	~4 GiB	uint32 overflow
RAM practical ceiling	~12 GB corpus	9.4× ratio × 118 GB
Next unlock	SA64 path	format change only
Long-term unlock	compressed SA	algorithmic change