fix: prevent handshake race condition causing ConnectionClosed failures#644
Conversation
33e1d3d to
66a2b48
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR fixes handshake race conditions in the SSV network implementation by deferring handshake initiation until after the Identify protocol confirms connection stability and protocol support.
Key Changes:
- Queue outbound handshakes instead of initiating immediately on connection establishment
- Wait for Identify protocol completion before starting handshakes to ensure stable connections
- Clean up pending handshakes on connection errors and closures
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| anchor/network/src/peer_manager/discovery.rs | Changed dial condition from DisconnectedAndNotDialing to NotDialing to reduce noise |
| anchor/network/src/network.rs | Added pending handshake queue, deferred initiation logic, and enhanced connection event handling |
| anchor/network/src/handshake/mod.rs | Added documentation and trace logging for handshake failure debugging |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
d6194c9 to
eac0a0e
Compare
AgeManning
left a comment
There was a problem hiding this comment.
I'm not sure I'm following why this is necessary.
Firstly, Identify at least in my mind is mainly used for debugging. In Lighthouse its more or less optional, I know ssv doesn't really have network specs, but I'm not sure if identify is a necessary required protocol.
I think all libp2p behaviours should work independently of each other. It allows behaviours to be composable. I think by tying the Handshake to Identify you are effectively saying the handshake protocol cannot function without identify. Which I dont think should be the case.
In regards to the actual issue, which seems to be concurrent handshake requests due to concurrent connections. I would have expected that when we dial, we dial a few multiaddrs (if multiple transports are supported). We do this in a single dial, via the dialopts. We then get libp2p to decide which transport we prioritize. I'm not sure why we would be calling dial again for any reason on an already connected peer. We should ensure we don't re-dial connected peers.
When we open a stream, libp2p uses multistream select to determine which protocols are supported. So we can have a connection with a peer, and the handshake can fail if they dont support it. I don't think this is an issue, if the negotation fails, we just drop the peer. We don't need identify to determine this in advance I dont think.
Would it be sufficient just to check num_established in connection_established and the connection direction before sending the handshake?. I.e if num_established = 1 and direction is outbound, then we do the handshake. This would simplify the changes and not rely on identify. Or is there something else I'm missing?
2208363 to
c141a85
Compare
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
3c4376a to
166809c
Compare
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
8be966d to
b7c427c
Compare
Fixes SSV handshake failures caused by opening protocol streams on connections that get closed due to concurrent dials or connection trimming. Observed as OutboundFailure::ConnectionClosed errors. The fix implements a pending handshake queue that defers handshake initiation until after the Identify protocol confirms the connection is stable and supports the SSV handshake protocol. Key changes: - Add pending_handshakes queue to defer handshake initiation - Wait for Identify protocol before initiating handshakes - Change dial policy from DisconnectedAndNotDialing to NotDialing - Add 30-second idle connection timeout for handshake completion - Improve logging to differentiate expected vs unexpected dial failures - Add comprehensive documentation explaining the race condition
Convert handshake behaviour from type alias to self-contained struct that automatically initiates handshakes on first outbound connections. This eliminates the need for external coordination and simplifies the architecture. Key Changes: - Replace type alias with proper Behaviour struct wrapping RequestResponseBehaviour - Implement NetworkBehaviour trait with auto-initiation in on_swarm_event - Store NodeInfo in the behaviour for automatic handshake initiation - Use `other_established == 0` check to prevent duplicate initiations during concurrent dials - Change create_behaviour() to Behaviour::new() (idiomatic Rust constructor) - Move initiate() to be a method on Behaviour struct - Remove manual handshake initiation from network.rs - Add concurrent_dials_both_initiate_handshake test proving the fix works - Add helper functions for cleaner test structure Evidence: The concurrent dials test proves that other_established == 0 prevents duplicate handshake initiations. Without the check, both peers initiate when they get Dialer connections (duplicate). With the check, only the first peer initiates. This makes handshakes fully composable without requiring Identify protocol coordination or external state management.
Convert standalone handle_event function to a method on Behaviour for better encapsulation and more idiomatic Rust. Extract test event handling logic into helper function to eliminate duplication and improve readability. API Changes: - handle_event() is now a method: behaviour.handle_event(event) - handle_request() and handle_response() moved into Behaviour impl as private methods - Removed unused initiate() method (handshakes are auto-initiated) - Simplified call sites in network.rs and tests Test Improvements: - Added handle_swarm_event_for_test() helper to eliminate 40+ lines of duplication - Tests are now much more readable and maintainable - All assertions and tracking logic centralized in one place Cleanup: - Removed .with_bandwidth_metrics() call (doesn't exist in current libp2p version) - Removed unused parameters from test helpers
|
@diegomrsantos: As Joao noted here #660 the second log line is not necessarily associated with the other two. I've checked the recent logs, and can not really find instances of this issue anymore. I want to avoid making significant changes to the handshake if they turn out to be motivated by an issue that does not exist. Nevertheless, during the evolution of this PR it seems to have become a reasonable refactor. However, some changes seem to be unrelated, such as removal of logging. Also, some of the comments seem to still reference the original approach of using the identity protocol. Can you please clean it up a bit so that it is easier to review and more focused on the refactor? |
if |
|
Claude finished @diegomrsantos's task —— View job Code Review CompleteI've conducted a comprehensive review of this PR focusing on code quality, security, performance, testing, and documentation. Here are my findings: SummaryThis PR refactors the handshake behavior to be self-contained and auto-initiating, addressing a potential race condition in concurrent connection scenarios. The implementation follows idiomatic rust-libp2p patterns and includes significant test improvements. Key Change: The handshake behavior now automatically initiates handshakes on the first outbound connection using the Critical Issues1. Race Condition Fix Necessity (Severity: Discussion Needed)Location: Throughout the PR As reviewers @AgeManning and @dknopik noted, the original race condition may no longer exist in production. The PR description shows logs that may not be causally related. However, the author demonstrates in Recommendation:
2. Breaking Change in Public API (Severity: High)Location: The PR fundamentally changes the public API:
// Old API (removed)
pub type Behaviour = RequestResponseBehaviour<Codec>;
pub type Event = <Behaviour as NetworkBehaviour>::ToSwarm;
pub fn handle_event(...) -> Option<Result<Completed, Failed>>
// New API
pub struct Behaviour { ... }
pub enum Event { Completed {...}, Failed {...} }Impact: This is a breaking change for any external code using the handshake module. Recommendation:
3. Error Handling: Silent Failure (Severity: Medium)Location: let _ = self.inner.send_response(channel, self.node_info.clone());The response sending can fail, but errors are silently discarded. While this may be acceptable for handshake responses (the peer will timeout), it makes debugging difficult. Recommendation: if let Err(e) = self.inner.send_response(channel, self.node_info.clone()) {
trace!(?peer_id, ?e, "Failed to send handshake response");
}4. Potential Panic in Production Code (Severity: High)Location: other => {
return Poll::Ready(
other.map_out(|_| unreachable!("GenerateEvent already handled")),
);
}While the comment explains this should be unreachable, using Recommendation: Consider a more defensive approach: other => {
trace!("Bubbling up non-GenerateEvent ToSwarm event");
return Poll::Ready(
other.map_out(|_| {
// This closure should never be called because we've handled all GenerateEvent variants.
// If it does get called, it indicates a programming error or unexpected libp2p behavior.
panic!("Unexpected: map_out closure called on non-GenerateEvent. This is a bug.")
}),
);
}Or use a custom error type instead of panicking. Code Quality Issues5. Let Chains Syntax (Severity: Info)Location: if let libp2p::swarm::FromSwarm::ConnectionEstablished(conn_est) = event
&& let libp2p::core::ConnectedPoint::Dialer { .. } = conn_est.endpoint
&& conn_est.other_established == 0Let chains require Rust 1.64+. While likely fine for this project, consider documenting minimum Rust version in 6. Clone Performance (Severity: Low)Location: Multiple locations in self.inner.send_request(peer_id, self.node_info.clone());
Analysis:
Recommendation: Profile first, optimize only if this shows up as a bottleneck. 7. Test Quality Improvement SuggestionLocation: The concurrent dials test verifies no duplicate handshakes by expecting a timeout on the second Recommendation: // Instead of timing out, explicitly check the event queue is empty
// Or use a shorter timeout (10-50ms) with a clear comment about why
let result = timeout(Duration::from_millis(50), async {
// This should timeout - no more events should occur
let ([_local], [_remote]): ([Event; 1], [Event; 1]) =
drive(&mut local_swarm, &mut remote_swarm).await;
}).await;
assert!(result.is_err(),
"Expected no more handshake events (timeout), but got additional events. \
This indicates duplicate handshakes occurred, violating the other_established==0 check.");Security Considerations8. Network Mismatch Handling (Severity: Info)Location: The network mismatch detection is good. The code properly rejects connections from peers on different networks. No issues found. 9. DoS Vector: Resource Exhaustion (Severity: Low)Location: events: VecDeque::new(),Events are buffered in a Analysis:
Recommendation: Document the assumption that connection limits prevent unbounded growth, or add a maximum queue size. Performance Considerations10. Idle Connection Timeout (Severity: Info)Location: .with_idle_connection_timeout(Duration::from_secs(30));Good improvement. The 30-second timeout provides time for handshake completion while still cleaning up idle connections. The comment explains the rationale well. 11. Dial Condition Change (Severity: Important)Location: - .condition(PeerCondition::NotDialing)
+ .condition(PeerCondition::DisconnectedAndNotDialing)This is a significant behavior change:
Impact: This prevents creating duplicate connections from concurrent dials, which complements the Recommendation: This change should be highlighted in the PR description as it's a critical part of preventing the race condition. Testing12. Test Improvements (Severity: Positive)Location: Excellent test improvements:
Positive feedback: The test quality is significantly improved. 13. Test Coverage Gap (Severity: Medium)Location: Test suite Missing test cases:
Recommendation: Consider adding these edge case tests, especially for rapid reconnection scenarios. Documentation14. Documentation Quality (Severity: Positive)Location: Good documentation:
15. PR Description Clarity (Severity: Info)The PR description is excellent:
Minor suggestion: Add a migration guide for any downstream code using the old Architecture Assessment16. Self-Contained Behavior Pattern (Severity: Positive)The refactor to make the handshake behavior self-contained is architecturally sound:
This aligns with how Lighthouse and other production rust-libp2p implementations structure their code. 17. Event Processing in
|
Remove redundant Completed and Failed structs that duplicated Event variant data. Pass Event directly to handlers instead of wrapping in Result. This eliminates unnecessary type conversions and follows idiomatic libp2p patterns for NetworkBehaviour events.
Re-add structured logging for connection errors that was removed in a previous refactoring. These errors are distinct from routine connection events: - OutgoingConnectionError: Failed to establish outgoing connection - IncomingConnectionError: Failed to accept incoming connection - ConnectionClosed with cause: Established connection failed at runtime Log errors at debug level and clean closures at trace level following libp2p's own logging patterns.
Replace silent failure handling with trace-level logging when handshake response cannot be sent. This occurs when the response channel is already closed (peer disconnected), which is not an error condition but useful for debugging handshake flows. Uses trace level as this is an expected race condition that doesn't require operator attention.
Issue Addressed
Fixes SSV handshake failures caused by race conditions between connection establishment and protocol stream opening.
Observed Symptoms
Proposed Changes
Evolution of Approach
First Attempt: Identify-based Coordination
Current Approach: Self-Contained Behaviour with Semantic Events
The handshake behaviour is now fully self-contained and auto-initiating:
Why This Works:
other_established == 0ensures only the FIRST connection initiates, preventing duplicatespoll()processes low-level RequestResponse events and emits semantic eventsEvidence
The concurrent dials test proves the fix:
Without
other_established == 0: Both peers initiate handshakes → duplicate requestsWith the check: First ConnectionEstablished (
other_established == 0) initiates, second (other_established == 1) skips → only ONE handshakeImplementation Details
Behaviour Changes:
poll()processes RequestResponse events internally and emits semantic Event::Completed/Event::FailedVecDeque<Event>) for buffering handshake resultson_swarm_event()Network Changes:
Test Improvements:
drive()for clean, declarative testingAdditional Info
Architecture
Follows idiomatic rust-libp2p patterns where behaviours are self-contained and composable. The behaviour:
poll()This is the same pattern used in Lighthouse and other production rust-libp2p implementations.
Compatibility
The asymmetric dialer-initiates pattern matches Go SSV's handshake flow, ensuring proper interoperability.