GLYPH FM Artifact Checksum Design
Status: design locked, implementation pending.
Problem#
Current fm.bin uses an explicit magic string:
FMBINv1\0
The current FM artifact has a versioned magic/header, but no checksum over the checkpoint payload.
Current corruption detection path:
- manifest.json links corpus and artifacts
- FM binary is checked for existence
- FM magic is checked at load
- FM checkpoint data is not checksummed internally
A corrupted checkpoint array could pass current header checks.
Current FMBINv1 layout#
Current layout written by src/build_fm.cpp:
offset size field
--------------------------------
0 8 magic = "FMBINv1\0"
8 8 n (uint64)
16 4 checkpoint_step (uint32)
20 8 num_blocks (uint64)
28 2048 C[256] uint64 table
2076 ... checkpoints[num_blocks][256] uint32
Version is encoded in the magic string.
There is no separate version field.
Why FMBINv2, not a patch to v1#
Appending a checksum to FMBINv1 would be ambiguous:
- old readers may ignore trailing bytes
- readers cannot distinguish plain v1 from "v1 plus checksum"
- the same magic would describe two layouts
Decision:
introduce FMBINv2
New readers should reject v1 when v2 is required. Old readers should reject v2 via magic mismatch.
No silent fallback.
Checksum algorithm#
Candidates:
- FNV-1a: simple, no dependency, slower
- CRC32C: fast with hardware support, hardware-dependent
- xxHash64: fast software hash, widely used, 64-bit output
Initial implementation decision:
FNV-1a 64-bit
Reason:
- no external dependency
- tiny implementation
- stable across platforms
- sufficient for corruption detection
- low implementation risk
Future optimization:
xxHash64 may replace FNV-1a in a later format version
if checksum time becomes measurable.
This checksum is not a security mechanism. It is an artifact corruption detector.
What the checksum covers#
Decision:
checkpoints payload only
Covered bytes:
checkpoints[num_blocks][256] uint32
Not covered:
- magic
- n
- checkpoint_step
- num_blocks
- C[256]
Reason:
The checkpoint payload is the large corruption-prone data region. Header fields are already parsed and validated structurally.
Future formats may choose to checksum the full post-magic payload, but v2 keeps the scope minimal and explicit.
Proposed FMBINv2 layout#
offset size field
--------------------------------
0 8 magic = "FMBINv2\0"
8 8 n (uint64)
16 4 checkpoint_step (uint32)
20 8 num_blocks (uint64)
28 2048 C[256] uint64 table
2076 8 checkpoint_payload_bytes (uint64)
2084 8 checkpoint_xxhash64 (uint64)
2092 ... checkpoints[num_blocks][256] uint32
Expected:
checkpoint_payload_bytes == num_blocks * 256 * sizeof(uint32)
Readers must validate:
- magic == FMBINv2\0
- checkpoint_step > 0
- num_blocks > 0
- checkpoint_payload_bytes matches expected size
- file contains exactly the expected checkpoint payload
- xxHash64(checkpoints) matches checkpoint_xxhash64
Backward compatibility#
FMBINv1:
- current format
- no internal checksum
- still documented
FMBINv2:
- checksum-bearing format
- required for future strict artifact verification
Migration:
- rebuild FM artifacts with updated build_fm
- keep old v1 docs for historical reference
- do not silently accept v1 in strict mode
Implementation plan#
- Vendor or implement xxHash64 without external build dependency.
- Update
src/build_fm.cpp:- write magic
FMBINv2\0 - compute xxHash64 over checkpoint payload
- write
checkpoint_payload_bytes - write
checkpoint_xxhash64
- write magic
- Update FM readers:
src/query_fm_v1.cppsrc/query_fm_server_v1.cppsrc/query_fm_batch_v1.cpp
- Add tests:
- valid FMBINv2 loads correctly
- bad magic rejected
- truncated payload rejected
- corrupted checkpoint byte rejected
- payload byte count mismatch rejected
- Rebuild generated FM artifacts:
- examples/mini
- bench_1gb when needed
Relationship to Query Tiers#
Tier 1:
may use trusted already-loaded FMBINv2
Tier 2:
uses manifest-level verification
Tier 3:
requires internal artifact checksum validation
FMBINv2 is a prerequisite for Tier 3.