BUG — FM count undercounts HDFS block-id pattern

Branch:

  • feature/segmented-v0.2

Status:

  • BLOCKER for segmented correctness validation

Dataset:

  • sanity_hdfs/corpus.bin
  • source: HDFS_1GB.log copied from bench_1gb/HDFS_1GB.log
  • size: 1GB

Pattern:

  • blk_-1000095285706020638

Ground truth:

  • Python bytes.count: 27

FM result after clean rebuild:

  • FM interval: [715386539, 715386552)
  • FM count: 13

Commands:

rm -rf sanity_hdfs
mkdir -p sanity_hdfs
cp bench_1gb/HDFS_1GB.log sanity_hdfs/corpus.bin

./build/build_sa_u32 sanity_hdfs/corpus.bin sanity_hdfs/sa.bin
./build/build_bwt sanity_hdfs/corpus.bin sanity_hdfs/sa.bin sanity_hdfs/bwt.bin 0
./build/build_fm sanity_hdfs/bwt.bin sanity_hdfs/fm.bin 128

python3 - << 'PY'

from pathlib import Path q = b"blk_-1000095285706020638" data = Path("sanity_hdfs/corpus.bin").read_bytes() print(data.count(q)) PY

./build/query_fm_v1 sanity_hdfs/fm.bin sanity_hdfs/bwt.bin 626c6b5f2d31303030303935323835373036303230363338

Control:

  • synthetic sanity_fm corpus returns correct count:
    • Python bytes.count: 1000
    • FM count: 1000

Conclusion:

The FM path works on small synthetic data but undercounts this real HDFS pattern on a 1GB corpus.

Next investigation:

  • validate BWT output against SA for sampled positions
  • verify FM rank/checkpoint logic on large BWT
  • test same pattern on smaller prefixes of HDFS corpus
  • bisect corpus size threshold

Size bisect#

Pattern:

  • blk_-1000095285706020638

Results:

Corpus prefix Python bytes.count FM count
64MB 0 0
128MB 0 0
256MB 0 0
512MB 17 3
1GB 27 13

Conclusion:

The bug appears once the target pattern exists in the HDFS corpus. This is not caused by segmented routing. The issue is in the FM/BWT/query path or in how this real corpus is indexed.

Size bisect#

Pattern:

  • blk_-1000095285706020638

Results:

Corpus prefix Python bytes.count FM count
64MB 0 0
128MB 0 0
256MB 0 0
512MB 17 3
1GB 27 13

Conclusion:

The bug appears once the target pattern exists in the HDFS corpus. This is not caused by segmented routing. The issue is in the FM/BWT/query path or in how this real corpus is indexed.

Localization update#

512MB HDFS prefix:

  • Python bytes.count: 17
  • SA scan count: 17
  • BWT validation for matching SA rows: 17 checked / 0 bad
  • FM count with checkpoint_step=128: 3
  • FM count with checkpoint_step=64: 3

Conclusion:

SA and BWT are correct for the target pattern. The undercount is not caused by checkpoint step size.

Likely faulty layer:

  • FM rank / occ calculation
  • C-table usage
  • backward-search implementation in query_fm_v1 / query_fm_server_v1

SA interval truth#

512MB HDFS prefix:

Pattern:

  • blk_-1000095285706020638

SA scan result:

  • SA match count: 17
  • correct SA interval: [359783091, 359783108)
  • matches are contiguous: true

FM result:

  • FM interval: [359783106, 359783109)
  • FM count: 3

Interpretation:

The FM query path returns an interval shifted inside/near the correct suffix-array interval. SA and BWT are correct. The fault is in FM backward-search support data or occ/C logic.

Root cause confirmed#

The FM undercount is caused by missing real appended sentinel in the indexed corpus.

Previous behavior:

  • corpus did not contain 0x00
  • build_bwt emitted synthetic sentinel byte for SA=0
  • FM C-table was built from BWT containing this synthetic sentinel
  • corpus alphabet and BWT/FM alphabet were inconsistent
  • backward search intervals shifted and undercounted real matches

Sentinel fix test:

  • input: 512MB HDFS prefix + appended 0x00
  • Python bytes.count: 17
  • FM count: 17
  • FM interval: [359783092, 359783109)

Conclusion:

FM correctness requires indexing corpus + real unique sentinel.

For v0.x, safe mode is:

  • require corpus does not contain 0x00
  • append 0x00 before SA/BWT/FM build
  • exclude sentinel from user query surface

Long-term raw-byte solution:

  • implement 257-symbol alphabet or explicit sentinel handling independent of byte values.

Status update#

Previous status:

  • BLOCKER for segmented correctness validation

Current status:

  • ROOT CAUSE FOUND
  • FIX PATH IMPLEMENTED

Implemented safeguards:

  • tools/prepare_sentinel_corpus_v1.py
  • tools/build_glyph_index_v1.sh

Canonical invariant:

GLYPH FM-index v0.x must index:

corpus + real appended 0x00 sentinel

The following flow is now required:

raw corpus
-> prepare_sentinel_corpus_v1.py
-> build_sa_u32
-> build_bwt
-> build_fm

Validation result:

  • problematic HDFS pattern now returns exact FM counts
  • segmented retrieval correctness investigation may continue

Remaining long-term limitation:

v0.x still assumes:

  • corpus must not contain 0x00

Future robust solution:

  • 257-symbol alphabet or explicit out-of-band sentinel representation