GLYPH Scaling Limits
Current state (v0.x)#
GLYPH v0.x uses SA32: suffix array stored as uint32.
Hard format limit:
corpus ≤ 4,294,967,295 bytes (~4 GiB)
Above this: SA32 overflow → silent corruption. No overflow detection exists yet. This is a known gap.
RAM economics#
Plain FM-index without compression requires approximately:
SA32: 4 bytes/symbol
BWT: 1 byte/symbol
FM index: ~4 bytes/symbol (checkpoints + Occ)
corpus: 1 byte/symbol
─────────────────────────────
total: ~9-10× corpus size in RAM
Observed on benchmark machine (HDFS 1GB):
corpus: 1.0 GB
total RAM: ~9.4 GB
ratio: 9.4×
Practical limits on current machine#
Benchmark machine: AMD EPYC 4344P, 118 GB RAM available.
corpus 5 GB → ~47 GB RAM comfortable
corpus 10 GB → ~94 GB RAM feasible
corpus 12 GB → ~113 GB RAM near limit
corpus 4 GB → SA32 format hard ceiling
The binding constraint below 12 GB corpus is SA32 format, not RAM.
Scaling ladder#
Step 1 — SA32 stable path (current)#
- works today
- corpus limit: ~4 GiB hard
- RAM limit: ~12 GB corpus on benchmark machine
- status: complete
Step 2 — SA64 path (next)#
- suffix array stored as uint64
- removes 4 GiB hard ceiling
- unlocks corpus 4–12 GB on current machine
- same RAM ratio (~9.4×)
- requires: builder changes, query binary changes, format versioning
- status: designed in SA64_DESIGN.md, not yet implemented
Step 3 — Segmented SA64#
- multiple SA64 shards
- fan-out query across shards
- unlocks corpus beyond single machine RAM
- cross-shard boundary matches still not detected
- status: planned
Step 4 — Compressed / sampled SA#
- sampled suffix array (every k-th entry stored)
- wavelet tree for Occ table
- reduces RAM ratio from ~9.4× to ~2-3×
- unlocks corpus 50–100+ GB on reasonable hardware
- locate cost increases by O(k) LF steps
- status: research
Why SA64 before compressed SA#
Compressed SA is more complex and changes the correctness model. SA64 is a format change with the same algorithm.
SA64 on current machine unlocks:
- corpus up to ~12 GB (RAM bound)
- multi-shard corpus beyond 4 GB per shard
Compressed SA becomes necessary only when:
- corpus exceeds available RAM / 9.4
- or RAM economics become the primary constraint
On a 118 GB machine, that threshold is ~12 GB corpus. Below that, SA64 is sufficient.
Known gaps before SA64#
Before implementing SA64:
- SA32 overflow detection at build time (hard error, not silent)
- index format versioning (magic bytes + version field)
- corpus hash in index header (stale index detection)
These must exist in SA32 path first. SA64 inherits the same format discipline.
Summary#
| Limit | Value | Cause |
|---|---|---|
| SA32 corpus ceiling | ~4 GiB | uint32 overflow |
| RAM practical ceiling | ~12 GB corpus | 9.4× ratio × 118 GB |
| Next unlock | SA64 path | format change only |
| Long-term unlock | compressed SA | algorithmic change |