GLYPH SA Container v1
Status:
- planned
- not yet implemented
- intended as migration path from raw SA32 to versioned SA artifacts
Purpose#
Current sa.bin is a raw uint32 suffix array.
That format is temporary and unversioned.
SA Container v1 defines a future explicit artifact format for suffix arrays, so GLYPH can distinguish:
- SA32 vs SA64
- endian assumptions
- entry width
- corpus size
- artifact version
This is required before introducing SA64.
Why raw sa.bin cannot evolve safely#
Current format:
sa.bin = raw uint32 array
Problems:
- no magic bytes
- no version field
- no entry width
- no corpus byte length
- no endian marker
- no artifact type marker
A raw uint32 SA file and a future raw uint64 SA file cannot be safely distinguished by readers without external metadata.
Therefore SA64 must not silently reuse raw sa.bin.
Proposed container layout#
Magic:
GLYPHSA1
Header:
offset size field
--------------------------------
0 8 magic = "GLYPHSA1"
8 4 version = 1 (uint32)
12 4 entry_width = 4 or 8 (uint32)
16 8 corpus_bytes (uint64)
24 8 sa_entries (uint64)
32 4 endian = 1 for little-endian (uint32)
36 4 reserved_flags (uint32)
40 ... suffix array entries
Entry encoding:
- entry_width = 4 → uint32 entries
- entry_width = 8 → uint64 entries
Compatibility policy#
Current GLYPH v0.x keeps writing:
sa.bin
as raw uint32 for existing tools.
SA Container v1 should be introduced as a separate artifact:
sa_v1.bin
Existing tools remain unchanged until container-aware readers exist.
Migration path:
- keep raw
sa.bin - add
sa_v1.binwriter - add container-aware SA reader
- update build_bwt to accept either raw SA32 or SA container
- introduce SA64 as
GLYPHSA1withentry_width = 8
Safety requirements#
SA container readers must validate:
- magic bytes
- version
- entry_width
- corpus_bytes > 0
- sa_entries == corpus_bytes
- file size matches header + entries
- all SA values are within
[0, corpus_bytes)
Failure must be fail-fast.
No silent fallback from bad container to raw mode.
Relationship to SA64#
SA64 should be a format-compatible extension of the container:
GLYPHSA1 + entry_width = 8
This avoids creating separate incompatible artifact families.
SA64 is then a data-width change, not a new undocumented file type.
Status#
This document defines the intended artifact contract.
Implementation is pending.