GLYPH Benchmark Methodology
What is measured#
GLYPH currently exposes two distinct latency layers.
These layers measure different operational realities and must not be compared directly.
Layer 1 — End-to-end verified query#
Tool:
benchmarks/cold_warm_v1.py
Measures the complete verified operational path:
Python startup
+ manifest integrity verification
+ verified query wrapper
+ query_fm_v1 subprocess launch
+ FM query execution
+ result parsing
This benchmark measures what a real CLI user experiences when using the verified query path.
Current mini corpus result (56-byte corpus, 2 occurrences of "error"):
cold: ~19.2 ms
warm p50: ~19.8 ms
warm p95: ~20.2 ms
warm p99: ~20.3 ms
Important:
The dominant cost here is process startup and verification overhead, not FM computation itself.
At mini scale, the FM backward-search portion is effectively negligible relative to Python/subprocess startup cost.
Layer 2 — Persistent FM backend query#
Tool:
benchmarks/persistent_fm_v1.py
Measures persistent in-memory FM querying:
mmap-loaded FM index
+ persistent C++ backend
+ backward search
+ count return
This benchmark excludes:
per-query Python startup
per-query subprocess startup
manifest verification overhead
The backend process is started once and reused for all warm queries.
Current mini benchmark result:
startup: ~1.0 ms
cold: ~0.025 ms
warm p50: ~0.007 ms
warm p95: ~0.009 ms
warm p99: ~0.010 ms
Example response:
20 22 2
Interpretation:
The persistent backend measures actual FM query latency once the index is already resident in memory.
This isolates FM search cost from operational wrapper overhead.
What is NOT measured#
The current benchmark suite does not yet measure:
- cold mmap page-fault behavior after reboot
- persistent backend latency under memory pressure
- concurrent query contention
- network/HTTP overhead
- index build time
- cross-machine reproducibility
- persistent backend p99 on large corpora
- shard fan-out overhead for segmented retrieval
Hardware disclaimer#
All benchmark results are machine-local measurements.
Numbers are not portable across machines.
Reproducible benchmark methodology requires documenting:
- CPU model
- RAM size
- storage type
- OS/kernel version
- Python version
- warm vs cold page cache state
Current benchmark machine specification is not yet committed.
This is a known documentation gap.
Why cold/warm separation matters#
Cold and warm queries measure different system behavior.
Warm query: FM algorithm cost with data already resident in memory.
Cold query: process startup + mmap initialization + page loading + cache population
Reporting only warm numbers hides first-query operational cost.
GLYPH benchmarks intentionally separate these layers.
Why p50/p95/p99 matter#
Average latency alone is insufficient.
Tail latency exposes:
- scheduler jitter
- page-cache misses
- process startup variance
- GC/runtime noise
- storage stalls
Interpretation guideline:
p99 >> p50
unstable latency envelope
p99 ≈ p50
predictable behavior
Current persistent backend behavior:
p50 ≈ 0.007 ms
p99 ≈ 0.010 ms
This indicates stable warm-query behavior at mini scale.
Known gaps#
- persistent backend benchmark on HDFS 1GB
- fixed reproducible query set committed to repo
- cold-start measurements after cache drop/reboot
- documented benchmark hardware spec
- segmented retrieval benchmark methodology
- shard fan-out p95/p99
- HTTP server overhead benchmark
- concurrent query benchmark
Benchmark files#
| File | Purpose |
|---|---|
benchmarks/cold_warm_v1.py |
End-to-end verified query benchmark |
benchmarks/persistent_fm_v1.py |
Persistent in-memory FM latency benchmark |
benchmarks/bench_1gb_persistent.py |
Legacy persistent 1GB benchmark |
benchmarks/bench_hdfs_1gb.sh |
Legacy HDFS 1GB benchmark pipeline |
benchmarks/HDFS_1GB_BENCHMARK.md |
Historical 1GB benchmark notes |
Interpretation#
GLYPH is not designed as a replacement for one-off grep scans.
The architecture targets deterministic repeated exact retrieval over prepared static corpora.
The two latency layers serve different operational models:
Persistent backend (~0.007 ms warm):
long-lived resident service
repeated exact queries
mmap-resident indexes
low-latency retrieval systems
Verified wrapper (~19 ms):
integrity-first workflows
CLI tooling
fail-fast artifact verification
operational correctness boundaries
These are different engineering tradeoffs and should not be compared as equivalent latency paths.