Segmented Fixed Correctness — GLYPH v0.2
Branch:
- feature/segmented-v0.2
Purpose:
Validate segmented retrieval after fixing FM sentinel semantics.
Root cause fixed:
- FM-index requires corpus + real appended 0x00 sentinel.
- Previous builds used synthetic BWT sentinel without appending a real sentinel to the indexed corpus.
- This caused shifted FM intervals and undercounting.
Fixed build flow:
raw corpus
-> tools/prepare_sentinel_corpus_v1.py
-> build_sa_u32
-> build_bwt
-> build_fm
Dataset:
- HDFS 1GB
- split into two independent 512MB shards
- each shard indexed through tools/build_glyph_index_v1.sh
Manifest:
- manifests/segmented_1gb_fixed.json
Pattern:
- blk_-1000095285706020638
Ground truth:
- shard0 Python bytes.count: 17
- shard1 Python bytes.count: 10
- total Python bytes.count: 27
GLYPH segmented result:
- shard0 FM count: 17
- shard1 FM count: 10
- merged total_count: 27
Conclusion:
Segmented retrieval correctness is restored when shards are built through the sentinel-safe indexing pipeline.