Skip to content

Comments

fix(reestablish): stabilize commitment replay across node restarts#1111

Open
jjyr wants to merge 14 commits intonervosnetwork:developfrom
jjyr:repro-restart-upstream-develop
Open

fix(reestablish): stabilize commitment replay across node restarts#1111
jjyr wants to merge 14 commits intonervosnetwork:developfrom
jjyr:repro-restart-upstream-develop

Conversation

@jjyr
Copy link
Collaborator

@jjyr jjyr commented Feb 7, 2026

Summary

This PR stabilizes the channel reestablish/replay path after node restart and removes a flaky Musig2VerifyError(BadSignature) class of failures in restart stress scenarios.

The branch is rebased from upstream/develop and intentionally keeps docs unchanged from upstream.

Problem

On restart-heavy scenarios (bidirectional in-flight TLC updates + reconnect/reestablish), replay could become inconsistent and lead to:

  • Musig2VerifyError(BadSignature) on CommitmentSigned verification
  • test-level panics caused by unexpected error events

In particular, the newly added ignored reproducer (test_node_restart) could fail on clean upstream baseline before fixes.

What changed

1) Persist and use CommitDiff for deterministic replay

  • Persist pending commit replay context in store
  • Add/extend commit-diff replay data used across restart/reestablish
  • Add store trait + implementation support:
    • store_pending_commit_diff
    • get_pending_commit_diff
    • delete_pending_commit_diff

2) Enforce strict replay validation

  • Reestablish replay now requires a valid pending CommitDiff when commitment replay is owed
  • Validate replay preconditions (channel/commitment alignment) before replaying

3) Preserve replay order and wire-order updates

  • Keep replay TLC updates in wire order
  • Persist and use replay ordering hints for dual-owed (RevokeAndAck + CommitmentSigned) cases
  • Respect last_was_revoke in replay ordering decisions

4) Final restart-path stabilization

  • During dual-owed replay, defer peer TLC updates and flush them at safe points
  • Reuse stored funding partial signature for replay when available
  • Deduplicate retryable TLC remove operations in network actor restart path

5) Tests

  • Add restart/reestablish regression tests
  • Rename stress test to clearer naming (test_restart_stress_multiple_restarts)
  • Add ignored long-running reproducer (test_node_restart)
  • Update affected test mocks/fixtures for new replay/store interfaces

Verification

Executed on this branch:

cargo nextest run -p fnn test_resolve_dual_owed_replay_order test_restart_stress_multiple_restarts --no-fail-fast
cargo nextest run -p fnn test_node_restart --run-ignored ignored-only --success-output final --no-fail-fast

Results:

  • Both commands passed
  • No BadSignature / Musig2VerifyError / panic markers in the final ignored restart run

Notes

  • This PR is intentionally focused on restart/reestablish replay stability and reproducibility.
  • No documentation files are changed in this PR.

@jjyr jjyr requested review from chenyukang, doitian and quake February 7, 2026 08:55
@jjyr jjyr marked this pull request as ready for review February 7, 2026 09:12
@quake quake requested a review from Copilot February 10, 2026 03:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves channel reestablish/replay determinism across node restarts by persisting a CommitDiff replay context, enforcing stricter replay validation, and stabilizing replay ordering (including dual-owed cases) to eliminate flaky Musig2VerifyError(BadSignature) failures in restart-heavy scenarios.

Changes:

  • Persist pending CommitDiff in the store and use it to replay CommitmentSigned deterministically after restart/reestablish.
  • Enforce stricter replay precondition validation and add ordering hints + temporary deferral/flush of peer TLC updates to preserve wire-order during sensitive replay windows.
  • Add/adjust tests for commit-diff serialization/validation and restart/reestablish regression coverage.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
crates/fiber-lib/src/store/tests/store.rs Updates store serialization tests to include new ChannelActorState fields.
crates/fiber-lib/src/store/store_impl/mod.rs Implements ChannelActorStateStore persistence APIs for pending CommitDiff.
crates/fiber-lib/src/store/schema.rs Adds a new key prefix for persisting pending CommitDiff.
crates/fiber-lib/src/fiber/tests/settle_tlc_set_command_tests.rs Extends the mock store to satisfy the updated store trait; updates test state construction for new fields.
crates/fiber-lib/src/fiber/tests/mod.rs Registers a new test module for commit-diff behavior.
crates/fiber-lib/src/fiber/tests/channel_commit_diff.rs Adds unit tests for CommitDiff roundtrip, replay validation, and dual-owed replay ordering.
crates/fiber-lib/src/fiber/tests/channel.rs Adds restart/reestablish regression tests and an ignored long-running restart reproducer.
crates/fiber-lib/src/fiber/serde_utils.rs Adds serde helpers for musig2 PartialSignature byte serialization.
crates/fiber-lib/src/fiber/network.rs Deduplicates queued retryable TLC remove operations on network actor restart paths.
crates/fiber-lib/src/fiber/channel.rs Core replay stabilization: persist CommitDiff, validate replay inputs, preserve wire order, implement deferred peer TLC update handling, and add replay ordering hint logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 21 to 29
pub(crate) const CHANNEL_ACTOR_STATE_PREFIX: u8 = 0;
pub(crate) const PEER_ID_NETWORK_ACTOR_STATE_PREFIX: u8 = 16;
pub(crate) const CKB_INVOICE_PREFIX: u8 = 32;
pub(crate) const PREIMAGE_PREFIX: u8 = 33;
pub(crate) const CKB_INVOICE_STATUS_PREFIX: u8 = 34;
pub(crate) const PENDING_COMMIT_DIFF_PREFIX: u8 = 48;
pub(crate) const PEER_ID_CHANNEL_ID_PREFIX: u8 = 64;
pub(crate) const CHANNEL_OUTPOINT_CHANNEL_ID_PREFIX: u8 = 65;
pub(crate) const BROADCAST_MESSAGE_PREFIX: u8 = 96;
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema header/table comment at the top of this file documents the prefix map, but it isn’t updated to include the newly introduced PENDING_COMMIT_DIFF_PREFIX (48). Please add an entry for prefix 48 (key/value shape) so the in-file schema documentation stays accurate.

Copilot uses AI. Check for mistakes.
Comment on lines +1658 to 1662
created_at_ms: now_timestamp_as_millis_u64(),
};
self.store
.store_pending_commit_diff(&state.get_id(), &commit_diff);
// Notify outside observers.
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CommitDiff is persisted here for replay, but I couldn’t find any call sites that delete it once the pending CommitmentSigned is acknowledged (e.g., after receiving RevokeAndAck / clearing waiting_ack) or when the channel is closed. Consider deleting the pending diff at the point the commit is fully acked to avoid keeping stale replay state around indefinitely.

Copilot uses AI. Check for mistakes.
Comment on lines 225 to 231
let bytes: Vec<u8> = Deserialize::deserialize(deserializer)?;
if bytes.len() != 32 {
return Err(serde::de::Error::custom(
"expected 32 bytes for PartialSignature",
));
}
PartialSignature::from_slice(&bytes).map_err(serde::de::Error::custom)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PartialSignature length is validated with a hard-coded 32. If musig2 exposes a canonical size constant (similar to SCHNORR_SIGNATURE_SIZE), it would be better to use that here to avoid magic numbers and to stay correct if the underlying type changes.

Copilot uses AI. Check for mistakes.
jjyr and others added 13 commits February 12, 2026 13:35
Channels created before CommitDiff was introduced have no stored diff,
causing reestablishment to fail. Fall back to the old develop-branch
behavior (resend_tlcs_on_reestablish) when no CommitDiff is present.
Also remove dead last_was_revoke assignment in CommitThenRevoke arm.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hout CommitDiff

Add two integration tests exercising the legacy reestablish fallback
when no CommitDiff is stored (simulating pre-CommitDiff channels):
- test_legacy_fallback_dual_owed_no_commit_diff (Path B1)
- test_legacy_fallback_single_owed_no_commit_diff (Path C)

Both delete CommitDiff from store before restart to force the legacy
resend_tlcs_on_reestablish path, then verify reestablish completes
with symmetric commitment numbers and functional channel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jjyr jjyr force-pushed the repro-restart-upstream-develop branch from 20f3ba0 to 94122b5 Compare February 12, 2026 05:41
The reestablish commits added new fields (pending_replay_updates,
defer_peer_tlc_updates, deferred_peer_tlc_updates, last_was_revoke)
to ChannelActorState but did not update the sample constructors,
causing CI compilation failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant