-
Notifications
You must be signed in to change notification settings - Fork 96
Open
Description
Summary
Resume path of rooch db state-prune snapshot can drop child nodes when a run is interrupted, causing final integrity check failures (missing child node).
Impact
Snapshots built via resume may be unusable; integrity check fails even when source DB is healthy.
Root causes
- Progress is persisted every 5 minutes; newly enqueued children can be lost if the process dies before save.
- Resume trusts snapshot_progress.json for worklist and nodes_written without reconciling with snapshot.db contents.
- nodes_written restored from file masks missing nodes; crash after pushing children but before write can leave parent present and child absent.
Repro (high level)
- Run
rooch db state-prune snapshot(default resume enabled). - Interrupt between progress saves (e.g., kill process after some batches).
- Resume; run completes but final integrity check reports missing child node.
Proposed fix (MVP)
- On resume, recompute nodes_written from snapshot.db (actual count) and prefer DB over progress file; warn on divergence.
- Make frontier durable: persist worklist/batch_buffer much more frequently (seconds) or log transactionally before batch writes.
- Safe resume: optionally rebuild worklist by scanning snapshot.db from root (enqueue parents with missing children) or force restart when progress is stale.
- Progress hygiene: if progress file is older/shorter than DB, delete or ignore to avoid partial frontier.
Acceptance
- Kill-and-resume cycles no longer produce missing-child errors.
- Integrity check passes after resumed runs; logged node count matches RocksDB actual.
- --no-resume behavior unchanged.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
No status