HDFS 1GB Benchmark — GLYPH v0.1

Dataset:

  • HDFS.log truncated to 1GB
  • source: Loghub / HDFS dataset

Query set:

  • 100 unique blk_* IDs extracted from the 1GB corpus

Baseline:

time while read -r q; do
  grep -F "$q" bench_1gb/HDFS_1GB.log > /dev/null
done < bench_1gb/queries.txt

Result:

  • grep -F repeated scan: 11.516 sec

GLYPH index:

  • SA32u built
  • BWT built
  • FM-index built
  • persistent backend: query_fm_server_v1

Index artifacts:

  • corpus: 1.0GB
  • SA32u: 4.0GB
  • BWT: 1.0GB
  • FM: 8.1GB

Persistent FM benchmark:

  • queries: 100
  • total_time_sec: 0.001673
  • avg_ms_per_query: 0.016727

Interpretation:

GLYPH is not a replacement for grep on one-off ad-hoc scans.

GLYPH is designed for repeated exact queries over pre-indexed static corpora. In that mode, it avoids rescanning the corpus and uses a persistent FM-index backend.