Skip to content

Comments

Fix deadlock in message validator#274

Merged
jking-aus merged 1 commit intosigp:unstablefrom
dknopik:fix-deadlock
Apr 17, 2025
Merged

Fix deadlock in message validator#274
jking-aus merged 1 commit intosigp:unstablefrom
dknopik:fix-deadlock

Conversation

@dknopik
Copy link
Member

@dknopik dknopik commented Apr 16, 2025

I noticed that sometimes, during a burst of incoming messages, Anchor would deadlock. More specifically, threads holding a processor permit would deadlock and the processor queue would become full shortly after.

The issue was doubly locking the network state. As RwLock (and thus tokio::sync::watch) avoids writer starvation by not allowing new readers while a writer is waiting, so trying to (read) borrow the lock while
a) holding a read lock already, and
b) a writer is waiting
is a deadlock, as we are now waiting for the writer, while the writer is waiting for us to release the already held read lock.

This PR fixes the issue by using the existing borrowed value instead of borrowing again. Furthermore, we explicitly drop the read guard to release it as early as possible.

@dknopik dknopik added bug Something isn't working ready-for-review This PR is ready to be reviewed labels Apr 16, 2025
@dknopik dknopik requested a review from diegomrsantos April 16, 2025 15:54
@diegomrsantos diegomrsantos requested a review from Copilot April 16, 2025 16:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

anchor/message_validator/src/lib.rs:253

  • [nitpick] Consider using a scoped block to limit the lifetime of 'network_state' rather than explicitly calling drop. This approach can help prevent accidental misuse of the lock in future modifications.
drop(network_state);

let operators_pks = self.get_operator_pks(signed_ssv_message.operator_ids())?;
let operators_pks =
get_operator_pks(&network_state, signed_ssv_message.operator_ids())?;
drop(network_state);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch, thanks. Is there a way we can avoid repeating this in the future? It still seems error-prone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we havelockbud in CI, which is supposed to catch stuff like this, but I guess it has its limits.

Otherwise we just have to stay extra vigilant during review, and adopt patterns like either dropping locks ASAP, or even better, acquiring them in a limited scope such as a function.

@jking-aus jking-aus merged commit 569be1d into sigp:unstable Apr 17, 2025
11 checks passed
@dknopik dknopik deleted the fix-deadlock branch June 20, 2025 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready-for-review This PR is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants