WHY DETERMINISTIC RETRIEVAL

Modern retrieval systems increasingly optimize for probabilistic relevance.

This is often useful:

  • semantic search
  • embeddings
  • reranking
  • contextual retrieval
  • approximate nearest-neighbor systems

However, these systems also introduce uncertainty.

The same query may:

  • return different results over time
  • depend on ranking heuristics
  • depend on model updates
  • depend on embedding drift
  • lose exact byte provenance

GLYPH explores the opposite direction.


Core idea#

GLYPH treats retrieval as an exact infrastructure problem.

Goal:

same bytes in → same matches out

The system operates over:

  • static corpora
  • exact byte substrings
  • deterministic index structures

No semantic interpretation is required.


Why exactness matters#

Exact retrieval becomes important when systems need:

  • reproducibility
  • auditability
  • exact provenance
  • stable byte offsets
  • deterministic verification
  • low-level observability

Examples include:

  • infrastructure logs
  • binary corpora
  • forensic analysis
  • retrieval validation
  • exact post-filtering beneath probabilistic systems

Deterministic vs probabilistic retrieval#

Probabilistic systems are often optimized for:

  • usefulness
  • semantic flexibility
  • approximate intent matching

Deterministic systems optimize for:

  • exact presence
  • reproducibility
  • stable retrieval semantics
  • infrastructure predictability

These goals are different.

GLYPH does not attempt to replace probabilistic systems.

Instead, it explores whether exact deterministic retrieval can serve as a stable verification substrate beneath them.


Exact verification layer#

One possible future architecture:

LLM ↓ semantic retrieval ↓ reranker ↓ GLYPH exact verifier ↓ exact byte offsets ↓ ground-truth confirmation

In this model:

  • probabilistic systems generate candidates
  • deterministic systems verify exact presence

Current limitations#

GLYPH is currently experimental.

Known limitations include:

  • high RAM overhead
  • static-corpus assumptions
  • evolving APIs
  • incomplete correctness coverage
  • limited operational hardening

The project is infrastructure research, not a production platform.


Research direction#

GLYPH currently explores:

  • FM-index infrastructure
  • suffix-array retrieval
  • deterministic substring search
  • mmap-based retrieval behavior
  • exact byte-offset reproducibility
  • retrieval observability
  • static-corpus verification semantics

Core principle:

exactness is a capability,
not a byproduct.