discovery: fix gossiper shutdown deadlock#10540
discovery: fix gossiper shutdown deadlock#10540Roasbeef merged 2 commits intolightningnetwork:masterfrom
Conversation
Summary of ChangesHello @morehouse, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical deadlock issue within the gossiper component, specifically during its shutdown sequence. The fix involves a targeted adjustment to an internal error channel's buffering capacity, preventing goroutines from blocking indefinitely when multiple errors are generated. This enhancement significantly improves the robustness and reliability of the gossiper's shutdown process, ensuring cleaner termination even under specific error conditions. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a potential deadlock in the gossiper during shutdown. The fix is simple and effective: increasing the buffer size of an error channel to prevent blocking when multiple errors are sent without a reader. The change is well-commented, explaining the reason for the fix and noting it as a temporary solution with a TODO for a future redesign. A comprehensive regression test has been added that accurately reproduces the deadlock scenario, ensuring the fix is effective and preventing future regressions. The test is clear and well-documented. Overall, this is a solid contribution that improves the stability of the gossiper.
🟠 PR Severity: HIGH
🟠 High (1 file)
🟢 Tests (1 file)
AnalysisThis PR fixes a shutdown deadlock in the gossiper component. The Why HIGH severity:
Review considerations:
To override, add a `severity-override-{critical,high,medium,low}` label. |
|
Thank you for the fix, please make sure the linter passes. Please also add release-notes for LND 21.0 release notes. |
5d1d71c to
3b7b8b0
Compare
|
Linter fixed and release note added. |
When processing a remote network announcement, it is possible for two error messages to be sent back on the errChan. Since Brontide doesn't actually read from errChan, and since errChan only buffered one error message, the sending goroutine would deadlock forever. This would only become apparent when the gossiper attempted to shut down and got hung up. For now, we can fix this simply by buffering up to two error messages on errChan. There is an existing TODO to restructure this logic entirely to use the actor model, and we can do a more thorough fix as part of that work. This bug was discovered while doing full node fuzz testing and was triggered by sending a specific channel_announcement message and then shutting down LND.
3b7b8b0 to
5fa025e
Compare
|
Rebased |
| // | ||
| // TODO(ziggie): Redesign this once the actor model pattern becomes | ||
| // available. See https://github.com/lightningnetwork/lnd/pull/9820. | ||
| errChan := make(chan error, 2) |
There was a problem hiding this comment.
Just wanted to share a few thoughts here. Even if the actor model pattern is not being used, it doesn’t feel intuitive to me that a single announcement message from a peer would result in two errors being sent on the error channel. Once we encounter the first error, it seems reasonable to return immediately.
If the current focus is just on fixing the issue rather than revisiting the flow, that’s totally fine. I mainly wanted to understand the reasoning behind emitting a second error here (especially since a blocked slot in the gossiper seems to have a much larger impact than a deadlock during shutdown).
There was a problem hiding this comment.
This PR should only focus on solving the problem which was found during fuzzing. However I think we can revisit this area in the future making sure that the design follows proper coding design principles.
There was a problem hiding this comment.
Generally I consider the pattern of sending errors over a channel in a function like this an anti-pattern. It's caused many bugs (like this) in the past, and can also make the control flow a bit harder to track as well.
Once we encounter the first error, it seems reasonable to return immediately.
Agreed. fn.Future can work nicely here for that, as it will just return that value instead of mssing around with channels here.
| // | ||
| // TODO(ziggie): Redesign this once the actor model pattern becomes | ||
| // available. See https://github.com/lightningnetwork/lnd/pull/9820. | ||
| errChan := make(chan error, 2) |
There was a problem hiding this comment.
Generally I consider the pattern of sending errors over a channel in a function like this an anti-pattern. It's caused many bugs (like this) in the past, and can also make the control flow a bit harder to track as well.
Once we encounter the first error, it seems reasonable to return immediately.
Agreed. fn.Future can work nicely here for that, as it will just return that value instead of mssing around with channels here.
|
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-10540-to-v0.20.x-branch
git worktree add --checkout .worktree/backport-10540-to-v0.20.x-branch backport-10540-to-v0.20.x-branch
cd .worktree/backport-10540-to-v0.20.x-branch
git reset --hard HEAD^
git cherry-pick -x 5fa025e9e7a180de0a81750757fcc20eb37b28a9
git push --force-with-lease |
…20.x-branch [v0.20.x-branch] Backport #10540: discovery: fix gossiper shutdown deadlock
When processing a remote network announcement, it is possible for two error messages to be sent back on the
errChan. SinceBrontidedoesn't actually read fromerrChan, and sinceerrChanonly buffered one error message, the sending goroutine would deadlock forever. This would only become apparent when the gossiper attempted to shut down and got hung up.For now, we can fix this simply by buffering up to two error messages on
errChan. There is an existing TODO to restructure this logic entirely to use the actor model, and we can do a more thorough fix as part of that work.This bug was discovered while doing full node fuzz testing and was triggered by sending a specific channel_announcement message and then shutting down LND.