SHARD BOUNDARY SEMANTICS

GLYPH v0.x currently supports segmented retrieval by splitting corpora into independently indexed shards.

This improves:

  • operational scalability
  • memory management
  • partial index reuse
  • distributed retrieval experiments

However, segmented retrieval introduces important semantic constraints.


Core invariant#

Each shard is indexed independently.

FM retrieval operates only within a single shard boundary.

GLYPH v0.x does NOT currently perform:

  • cross-shard stitching
  • overlap-aware reconstruction
  • boundary-spanning verification
  • multi-shard suffix continuation

Consequence#

Patterns spanning shard boundaries may be missed.

Example:

shard0 ends with:

blk_000

shard1 begins with:

123\n

Query:

blk_000123

Expected global-corpus count:

1

Current segmented result:

0

because the pattern crosses a shard boundary.


Current status#

This behavior is currently:

  • known
  • expected
  • architectural

It is NOT currently treated as a bug.


Why this matters#

Segmented retrieval correctness depends on whether retrieval semantics are defined as:

A: exact retrieval within independent shards

or:

B: exact retrieval over the logical global corpus

GLYPH v0.x currently implements A.

It does not yet implement B.


Future possible approaches#

Future versions may support boundary-safe retrieval via:

  • overlap regions
  • shard stitching
  • rolling suffix carry-over
  • hierarchical verification layers
  • cross-shard continuation indexes

None are currently implemented.


Current recommendation#

Segmented retrieval should currently be treated as:

exact retrieval within independently indexed shard regions

not as globally complete substring retrieval.


Testing implications#

Future regression tests should explicitly include:

  • boundary-crossing patterns
  • overlap-region semantics
  • duplicate-offset handling
  • shard-local vs global retrieval expectations

This prevents accidental semantic drift.


Core principle#

Segmented retrieval correctness must be defined explicitly.

Silent incompleteness is more dangerous than explicit constraints.