GLYPH Benchmark Methodology

What is measured#

GLYPH currently exposes two distinct latency layers.

These layers measure different operational realities and must not be compared directly.

Layer 1 — End-to-end verified query#

Tool:

benchmarks/cold_warm_v1.py

Measures the complete verified operational path:

Python startup
+ manifest integrity verification
+ verified query wrapper
+ query_fm_v1 subprocess launch
+ FM query execution
+ result parsing

This benchmark measures what a real CLI user experiences when using the verified query path.

Current mini corpus result (56-byte corpus, 2 occurrences of "error"):

cold:      ~19.2 ms
warm p50:  ~19.8 ms
warm p95:  ~20.2 ms
warm p99:  ~20.3 ms

Important:

The dominant cost here is process startup and verification overhead, not FM computation itself.

At mini scale, the FM backward-search portion is effectively negligible relative to Python/subprocess startup cost.

Layer 2 — Persistent FM backend query#

Tool:

benchmarks/persistent_fm_v1.py

Measures persistent in-memory FM querying:

mmap-loaded FM index
+ persistent C++ backend
+ backward search
+ count return

This benchmark excludes:

per-query Python startup
per-query subprocess startup
manifest verification overhead

The backend process is started once and reused for all warm queries.

Current mini benchmark result:

startup:   ~1.0 ms
cold:      ~0.025 ms
warm p50:  ~0.007 ms
warm p95:  ~0.009 ms
warm p99:  ~0.010 ms

Example response:

20 22 2

Interpretation:

The persistent backend measures actual FM query latency once the index is already resident in memory.

This isolates FM search cost from operational wrapper overhead.

What is NOT measured#

The current benchmark suite does not yet measure:

cold mmap page-fault behavior after reboot
persistent backend latency under memory pressure
concurrent query contention
network/HTTP overhead
index build time
cross-machine reproducibility
persistent backend p99 on large corpora
shard fan-out overhead for segmented retrieval

Hardware disclaimer#

All benchmark results are machine-local measurements.

Numbers are not portable across machines.

Reproducible benchmark methodology requires documenting:

CPU model
RAM size
storage type
OS/kernel version
Python version
warm vs cold page cache state

Current benchmark machine specification is not yet committed.

This is a known documentation gap.

Why cold/warm separation matters#

Cold and warm queries measure different system behavior.

Warm query: FM algorithm cost with data already resident in memory.

Cold query: process startup + mmap initialization + page loading + cache population

Reporting only warm numbers hides first-query operational cost.

GLYPH benchmarks intentionally separate these layers.

Why p50/p95/p99 matter#

Average latency alone is insufficient.

Tail latency exposes:

scheduler jitter
page-cache misses
process startup variance
GC/runtime noise
storage stalls

Interpretation guideline:

p99 >> p50
    unstable latency envelope

p99 ≈ p50
    predictable behavior

Current persistent backend behavior:

p50 ≈ 0.007 ms
p99 ≈ 0.010 ms

This indicates stable warm-query behavior at mini scale.

Known gaps#

persistent backend benchmark on HDFS 1GB
fixed reproducible query set committed to repo
cold-start measurements after cache drop/reboot
documented benchmark hardware spec
segmented retrieval benchmark methodology
shard fan-out p95/p99
HTTP server overhead benchmark
concurrent query benchmark

Benchmark files#

File	Purpose
`benchmarks/cold_warm_v1.py`	End-to-end verified query benchmark
`benchmarks/persistent_fm_v1.py`	Persistent in-memory FM latency benchmark
`benchmarks/bench_1gb_persistent.py`	Legacy persistent 1GB benchmark
`benchmarks/bench_hdfs_1gb.sh`	Legacy HDFS 1GB benchmark pipeline
`benchmarks/HDFS_1GB_BENCHMARK.md`	Historical 1GB benchmark notes

Interpretation#

GLYPH is not designed as a replacement for one-off grep scans.

The architecture targets deterministic repeated exact retrieval over prepared static corpora.

The two latency layers serve different operational models:

Persistent backend (~0.007 ms warm):

long-lived resident service
repeated exact queries
mmap-resident indexes
low-latency retrieval systems

Verified wrapper (~19 ms):

integrity-first workflows
CLI tooling
fail-fast artifact verification
operational correctness boundaries

These are different engineering tradeoffs and should not be compared as equivalent latency paths.