BUG — FM count undercounts HDFS block-id pattern
Branch:
- feature/segmented-v0.2
Status:
- BLOCKER for segmented correctness validation
Dataset:
- sanity_hdfs/corpus.bin
- source: HDFS_1GB.log copied from bench_1gb/HDFS_1GB.log
- size: 1GB
Pattern:
- blk_-1000095285706020638
Ground truth:
- Python bytes.count: 27
FM result after clean rebuild:
- FM interval: [715386539, 715386552)
- FM count: 13
Commands:
rm -rf sanity_hdfs
mkdir -p sanity_hdfs
cp bench_1gb/HDFS_1GB.log sanity_hdfs/corpus.bin
./build/build_sa_u32 sanity_hdfs/corpus.bin sanity_hdfs/sa.bin
./build/build_bwt sanity_hdfs/corpus.bin sanity_hdfs/sa.bin sanity_hdfs/bwt.bin 0
./build/build_fm sanity_hdfs/bwt.bin sanity_hdfs/fm.bin 128
python3 - << 'PY'
from pathlib import Path q = b"blk_-1000095285706020638" data = Path("sanity_hdfs/corpus.bin").read_bytes() print(data.count(q)) PY
./build/query_fm_v1 sanity_hdfs/fm.bin sanity_hdfs/bwt.bin 626c6b5f2d31303030303935323835373036303230363338
Control:
- synthetic sanity_fm corpus returns correct count:
- Python bytes.count: 1000
- FM count: 1000
Conclusion:
The FM path works on small synthetic data but undercounts this real HDFS pattern on a 1GB corpus.
Next investigation:
- validate BWT output against SA for sampled positions
- verify FM rank/checkpoint logic on large BWT
- test same pattern on smaller prefixes of HDFS corpus
- bisect corpus size threshold
Size bisect#
Pattern:
- blk_-1000095285706020638
Results:
| Corpus prefix | Python bytes.count | FM count |
|---|---|---|
| 64MB | 0 | 0 |
| 128MB | 0 | 0 |
| 256MB | 0 | 0 |
| 512MB | 17 | 3 |
| 1GB | 27 | 13 |
Conclusion:
The bug appears once the target pattern exists in the HDFS corpus. This is not caused by segmented routing. The issue is in the FM/BWT/query path or in how this real corpus is indexed.
Size bisect#
Pattern:
- blk_-1000095285706020638
Results:
| Corpus prefix | Python bytes.count | FM count |
|---|---|---|
| 64MB | 0 | 0 |
| 128MB | 0 | 0 |
| 256MB | 0 | 0 |
| 512MB | 17 | 3 |
| 1GB | 27 | 13 |
Conclusion:
The bug appears once the target pattern exists in the HDFS corpus. This is not caused by segmented routing. The issue is in the FM/BWT/query path or in how this real corpus is indexed.
Localization update#
512MB HDFS prefix:
- Python bytes.count: 17
- SA scan count: 17
- BWT validation for matching SA rows: 17 checked / 0 bad
- FM count with checkpoint_step=128: 3
- FM count with checkpoint_step=64: 3
Conclusion:
SA and BWT are correct for the target pattern. The undercount is not caused by checkpoint step size.
Likely faulty layer:
- FM rank / occ calculation
- C-table usage
- backward-search implementation in query_fm_v1 / query_fm_server_v1
SA interval truth#
512MB HDFS prefix:
Pattern:
- blk_-1000095285706020638
SA scan result:
- SA match count: 17
- correct SA interval: [359783091, 359783108)
- matches are contiguous: true
FM result:
- FM interval: [359783106, 359783109)
- FM count: 3
Interpretation:
The FM query path returns an interval shifted inside/near the correct suffix-array interval. SA and BWT are correct. The fault is in FM backward-search support data or occ/C logic.
Root cause confirmed#
The FM undercount is caused by missing real appended sentinel in the indexed corpus.
Previous behavior:
- corpus did not contain 0x00
- build_bwt emitted synthetic sentinel byte for SA=0
- FM C-table was built from BWT containing this synthetic sentinel
- corpus alphabet and BWT/FM alphabet were inconsistent
- backward search intervals shifted and undercounted real matches
Sentinel fix test:
- input: 512MB HDFS prefix + appended 0x00
- Python bytes.count: 17
- FM count: 17
- FM interval: [359783092, 359783109)
Conclusion:
FM correctness requires indexing corpus + real unique sentinel.
For v0.x, safe mode is:
- require corpus does not contain 0x00
- append 0x00 before SA/BWT/FM build
- exclude sentinel from user query surface
Long-term raw-byte solution:
- implement 257-symbol alphabet or explicit sentinel handling independent of byte values.
Status update#
Previous status:
- BLOCKER for segmented correctness validation
Current status:
- ROOT CAUSE FOUND
- FIX PATH IMPLEMENTED
Implemented safeguards:
- tools/prepare_sentinel_corpus_v1.py
- tools/build_glyph_index_v1.sh
Canonical invariant:
GLYPH FM-index v0.x must index:
corpus + real appended 0x00 sentinel
The following flow is now required:
raw corpus
-> prepare_sentinel_corpus_v1.py
-> build_sa_u32
-> build_bwt
-> build_fm
Validation result:
- problematic HDFS pattern now returns exact FM counts
- segmented retrieval correctness investigation may continue
Remaining long-term limitation:
v0.x still assumes:
- corpus must not contain 0x00
Future robust solution:
- 257-symbol alphabet or explicit out-of-band sentinel representation