GLYPH Index Format v1
Status:
- experimental
- unstable during v0.x
Purpose:
Define the binary layout and invariants of GLYPH FM-index artifacts.
FM binary header#
Magic:
FMBINv1\0
Layout:
offset size field
--------------------------------
0 8 magic
8 8 corpus_bytes (uint64)
16 4 checkpoint_step (uint32)
20 8 num_blocks (uint64)
28 2048 C[256] uint64 table
... checkpoints
Checkpoint layout:
checkpoints[num_blocks][256]
Stored as:
uint32 per symbol count
Meaning:
Each checkpoint stores cumulative occurrence counts for all byte values before the corresponding block.
Artifact formats#
Current GLYPH v0.x artifacts:
| Artifact | Format | Versioned | Notes |
|---|---|---|---|
| fm.bin | FMBINv1\0 | yes | main FM index |
| fm_core.bin | FMV1 | yes | locate backend FM core |
| locate_core_s*.bin | LOC1 | yes | sampled locate structure |
| manifest.json | GLYPH_INDEX_MANIFEST_V1 | yes | integrity manifest |
| sa.bin | raw uint32 array | no | SA32 only |
| bwt.bin | raw uint8 stream | no | no header/version |
SA32 constraints#
Current SA format:
sa.bin = raw uint32 suffix array
Implications:
- hard corpus limit: 4,294,967,295 bytes
- no embedded version field
- no embedded corpus hash
- no embedded endian marker
SA64 cannot reuse this artifact format safely.
A future SA64 format requires:
- explicit magic bytes
- explicit version field
- explicit entry width
- compatibility policy
Status:
- SA32 raw format is temporary
- SA64 will introduce a new artifact format
BWT assumptions#
GLYPH v0.x assumes:
indexed_corpus = raw_corpus + appended 0x00 sentinel
Required invariant:
- raw corpus must not contain 0x00
- appended sentinel must be unique
Failure to satisfy this invariant may produce:
- shifted FM intervals
- incorrect occurrence counts
- deterministic undercounting
Query semantics#
GLYPH performs exact byte matching.
Returned result:
suffix-array interval [l, r)
Match count:
r - l
Segmented manifests#
Segmented retrieval uses a manifest describing shards.
Each shard contains:
- corpus
- suffix array
- BWT
- FM index
Global retrieval result is produced by:
- independent shard query
- deterministic count merge
Compatibility#
v0.x formats are not yet stable.
No backward compatibility guarantees currently exist.