fix(reestablish): stabilize commitment replay across node restarts#1111
fix(reestablish): stabilize commitment replay across node restarts#1111jjyr wants to merge 14 commits intonervosnetwork:developfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves channel reestablish/replay determinism across node restarts by persisting a CommitDiff replay context, enforcing stricter replay validation, and stabilizing replay ordering (including dual-owed cases) to eliminate flaky Musig2VerifyError(BadSignature) failures in restart-heavy scenarios.
Changes:
- Persist pending
CommitDiffin the store and use it to replayCommitmentSigneddeterministically after restart/reestablish. - Enforce stricter replay precondition validation and add ordering hints + temporary deferral/flush of peer TLC updates to preserve wire-order during sensitive replay windows.
- Add/adjust tests for commit-diff serialization/validation and restart/reestablish regression coverage.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/fiber-lib/src/store/tests/store.rs | Updates store serialization tests to include new ChannelActorState fields. |
| crates/fiber-lib/src/store/store_impl/mod.rs | Implements ChannelActorStateStore persistence APIs for pending CommitDiff. |
| crates/fiber-lib/src/store/schema.rs | Adds a new key prefix for persisting pending CommitDiff. |
| crates/fiber-lib/src/fiber/tests/settle_tlc_set_command_tests.rs | Extends the mock store to satisfy the updated store trait; updates test state construction for new fields. |
| crates/fiber-lib/src/fiber/tests/mod.rs | Registers a new test module for commit-diff behavior. |
| crates/fiber-lib/src/fiber/tests/channel_commit_diff.rs | Adds unit tests for CommitDiff roundtrip, replay validation, and dual-owed replay ordering. |
| crates/fiber-lib/src/fiber/tests/channel.rs | Adds restart/reestablish regression tests and an ignored long-running restart reproducer. |
| crates/fiber-lib/src/fiber/serde_utils.rs | Adds serde helpers for musig2 PartialSignature byte serialization. |
| crates/fiber-lib/src/fiber/network.rs | Deduplicates queued retryable TLC remove operations on network actor restart paths. |
| crates/fiber-lib/src/fiber/channel.rs | Core replay stabilization: persist CommitDiff, validate replay inputs, preserve wire order, implement deferred peer TLC update handling, and add replay ordering hint logic. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub(crate) const CHANNEL_ACTOR_STATE_PREFIX: u8 = 0; | ||
| pub(crate) const PEER_ID_NETWORK_ACTOR_STATE_PREFIX: u8 = 16; | ||
| pub(crate) const CKB_INVOICE_PREFIX: u8 = 32; | ||
| pub(crate) const PREIMAGE_PREFIX: u8 = 33; | ||
| pub(crate) const CKB_INVOICE_STATUS_PREFIX: u8 = 34; | ||
| pub(crate) const PENDING_COMMIT_DIFF_PREFIX: u8 = 48; | ||
| pub(crate) const PEER_ID_CHANNEL_ID_PREFIX: u8 = 64; | ||
| pub(crate) const CHANNEL_OUTPOINT_CHANNEL_ID_PREFIX: u8 = 65; | ||
| pub(crate) const BROADCAST_MESSAGE_PREFIX: u8 = 96; |
There was a problem hiding this comment.
The schema header/table comment at the top of this file documents the prefix map, but it isn’t updated to include the newly introduced PENDING_COMMIT_DIFF_PREFIX (48). Please add an entry for prefix 48 (key/value shape) so the in-file schema documentation stays accurate.
| created_at_ms: now_timestamp_as_millis_u64(), | ||
| }; | ||
| self.store | ||
| .store_pending_commit_diff(&state.get_id(), &commit_diff); | ||
| // Notify outside observers. |
There was a problem hiding this comment.
CommitDiff is persisted here for replay, but I couldn’t find any call sites that delete it once the pending CommitmentSigned is acknowledged (e.g., after receiving RevokeAndAck / clearing waiting_ack) or when the channel is closed. Consider deleting the pending diff at the point the commit is fully acked to avoid keeping stale replay state around indefinitely.
| let bytes: Vec<u8> = Deserialize::deserialize(deserializer)?; | ||
| if bytes.len() != 32 { | ||
| return Err(serde::de::Error::custom( | ||
| "expected 32 bytes for PartialSignature", | ||
| )); | ||
| } | ||
| PartialSignature::from_slice(&bytes).map_err(serde::de::Error::custom) |
There was a problem hiding this comment.
PartialSignature length is validated with a hard-coded 32. If musig2 exposes a canonical size constant (similar to SCHNORR_SIGNATURE_SIZE), it would be better to use that here to avoid magic numbers and to stay correct if the underlying type changes.
Channels created before CommitDiff was introduced have no stored diff, causing reestablishment to fail. Fall back to the old develop-branch behavior (resend_tlcs_on_reestablish) when no CommitDiff is present. Also remove dead last_was_revoke assignment in CommitThenRevoke arm. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hout CommitDiff Add two integration tests exercising the legacy reestablish fallback when no CommitDiff is stored (simulating pre-CommitDiff channels): - test_legacy_fallback_dual_owed_no_commit_diff (Path B1) - test_legacy_fallback_single_owed_no_commit_diff (Path C) Both delete CommitDiff from store before restart to force the legacy resend_tlcs_on_reestablish path, then verify reestablish completes with symmetric commitment numbers and functional channel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
20f3ba0 to
94122b5
Compare
The reestablish commits added new fields (pending_replay_updates, defer_peer_tlc_updates, deferred_peer_tlc_updates, last_was_revoke) to ChannelActorState but did not update the sample constructors, causing CI compilation failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
This PR stabilizes the channel reestablish/replay path after node restart and removes a flaky
Musig2VerifyError(BadSignature)class of failures in restart stress scenarios.The branch is rebased from
upstream/developand intentionally keeps docs unchanged from upstream.Problem
On restart-heavy scenarios (bidirectional in-flight TLC updates + reconnect/reestablish), replay could become inconsistent and lead to:
Musig2VerifyError(BadSignature)onCommitmentSignedverificationIn particular, the newly added ignored reproducer (
test_node_restart) could fail on clean upstream baseline before fixes.What changed
1) Persist and use
CommitDifffor deterministic replaystore_pending_commit_diffget_pending_commit_diffdelete_pending_commit_diff2) Enforce strict replay validation
CommitDiffwhen commitment replay is owed3) Preserve replay order and wire-order updates
RevokeAndAck+CommitmentSigned) caseslast_was_revokein replay ordering decisions4) Final restart-path stabilization
5) Tests
test_restart_stress_multiple_restarts)test_node_restart)Verification
Executed on this branch:
Results:
BadSignature/Musig2VerifyError/ panic markers in the final ignored restart runNotes