CORRECTNESS INVARIANTS

This document defines the current correctness contracts of GLYPH v0.x.

The goal is to formalize:

  • retrieval semantics
  • index assumptions
  • deterministic guarantees
  • sentinel behavior
  • component contracts

These invariants are treated as architectural constraints, not implementation details.


1. Corpus invariants#

1.1 Static corpus assumption#

GLYPH v0.x assumes:

  • immutable corpora
  • offline index construction
  • read-only query behavior

Indexes are not currently designed for dynamic mutation.


1.2 Raw byte semantics#

GLYPH operates over raw bytes.

No:

  • tokenization
  • normalization
  • encoding interpretation
  • semantic preprocessing

All retrieval semantics are byte-exact.


1.3 Sentinel exclusion invariant#

Input corpus MUST NOT contain:

0x00

Reason: GLYPH v0.x appends a real terminal sentinel byte during index construction.

Violation of this invariant invalidates retrieval semantics.

Future versions may replace this with:

  • 257-symbol alphabets
  • out-of-band sentinel representations
  • alternative suffix termination schemes

2. Sentinel invariants#

2.1 Real appended sentinel#

GLYPH v0.x indexes:

raw_corpus + appended_terminal_0x00

The sentinel is physically appended to indexed corpus data.

This is NOT a synthetic virtual sentinel.


2.2 Exactly one sentinel#

Indexed corpus MUST contain exactly one terminal sentinel.

Invariant:

count(0x00) == 1


2.3 Sentinel query behavior#

Querying:

b"\x00"

MUST return exactly one match.

This behavior is intentional and currently part of retrieval semantics.


3. Suffix array invariants#

3.1 SA correctness#

Suffix array ordering MUST preserve lexicographic suffix ordering over the sentinel-augmented corpus.


3.2 SA determinism#

Given identical corpus bytes:

  • suffix array construction MUST produce identical output
  • retrieval intervals MUST remain stable

3.3 SA ↔ corpus consistency#

All suffix array offsets MUST reference valid corpus positions.

No offset may exceed corpus length.


4. BWT invariants#

4.1 True BWT construction#

BWT MUST be constructed from:

  • the real suffix array
  • the real sentinel-appended corpus

Synthetic sentinel substitution is forbidden.


4.2 BWT length invariant#

Invariant:

len(BWT) == len(corpus_with_sentinel)


4.3 Sentinel consistency#

The sentinel position implied by:

  • corpus
  • suffix array
  • BWT
  • FM-index

MUST remain globally consistent.


5. FM-index invariants#

5.1 FM interval determinism#

Identical:

  • corpus
  • query
  • index files

MUST produce identical intervals.


5.2 Exact occurrence counting#

FM counts MUST match exact substring occurrence counts over:

  • overlapping matches
  • single-byte patterns
  • full-corpus matches

Python byte-search oracle currently acts as verification reference.


5.3 No probabilistic retrieval#

FM retrieval semantics are exact.

No:

  • approximation
  • fuzzy matching
  • ranking
  • heuristic expansion

6. Query invariants#

6.1 Exact byte matching#

Queries are matched byte-for-byte.

No:

  • Unicode normalization
  • stemming
  • lowercasing
  • tokenizer behavior

6.2 Overlapping occurrence semantics#

Overlapping matches MUST be counted.

Example:

corpus = b"aaaa" pattern = b"aa"

Expected count:

3

Offsets:

[0,1,2]


6.3 Deterministic query semantics#

Repeated identical queries over identical indexes MUST return identical results.


7. Retrieval semantics#

GLYPH currently guarantees only:

  • exact byte-level presence
  • deterministic retrieval behavior
  • stable retrieval semantics
  • exact occurrence counting

GLYPH does NOT guarantee:

  • semantic correctness
  • factual correctness
  • relevance
  • ranking quality
  • interpretation quality

8. Current verification mechanisms#

Current verification includes:

  • regression correctness tests
  • Python oracle comparison
  • sentinel-specific tests
  • deterministic repeated-query checks
  • HDFS correctness validation

9. Known limitations#

Current limitations include:

  • high RAM overhead
  • static corpus assumption
  • evolving APIs
  • incomplete locate-layer verification
  • incomplete large-scale fuzz coverage

GLYPH v0.x should currently be treated as experimental infrastructure research.


Core principle#

Correctness is treated as an architectural property.

Not:

  • a benchmark optimization
  • a side effect
  • a probabilistic approximation
  • a UI feature