Fix pool reference deadlock during PAUSE/RELOAD/RESUME#941
Open
bhtek wants to merge 1 commit intopostgresml:mainfrom
Open
Fix pool reference deadlock during PAUSE/RELOAD/RESUME#941bhtek wants to merge 1 commit intopostgresml:mainfrom
bhtek wants to merge 1 commit intopostgresml:mainfrom
Conversation
This commit fixes a critical deadlock issue where client connections would hang indefinitely during database switchover operations involving PAUSE, RELOAD, and RESUME commands in transaction pooling mode. ## Problem When performing a hot database switchover (PAUSE → RELOAD → RESUME), client applications would become permanently stuck when trying to establish new connections, even after RESUME completed. This made PgCat unsuitable for zero-downtime database migration scenarios. The issue had two root causes: 1. **Pool Reference Deadlock**: When RELOAD creates a new pool object, clients waiting on the old pool's `paused_waiter` were never woken up when RESUME was called on the new pool. This caused permanent deadlock because: - Client holds reference to OLD pool (pre-RELOAD) - Client blocks on `OLD_pool.wait_paused().await` - RELOAD creates NEW pool with different `paused_waiter` - RESUME calls `NEW_pool.resume()` - Client still waiting on OLD pool → deadlock! 2. **Unvalidated Pools After RELOAD**: New pools created during PAUSE were not validated before use, potentially causing authentication to block if validation was triggered during client connection. ## Solution This fix implements a two-part solution: ### Part 1: Make resume() async and validate pools (pool.rs) - Changed `resume()` from sync to async function - Added pool validation before resuming if pool is unvalidated - Ensures pools are ready for use before accepting client connections - Updated all call sites in admin.rs to await the async function ### Part 2: Resume old pools before RELOAD (config.rs) - Before creating new pools during RELOAD, explicitly resume all paused old pools to wake up waiting clients - This allows clients to: 1. Wake up from old pool's `wait_paused()` 2. Reach the `pool = self.get_pool()` refresh line 3. Obtain reference to NEW pool 4. Continue normal operation with new pool ## Testing **Unit Tests**: All 38 unit tests + 4 doc tests pass - ✓ `cargo test` - all tests passing - ✓ `cargo fmt` - code properly formatted - ✓ `cargo clippy` - no warnings **Integration Tests**: Real-world database switchover scenario - ✓ PAUSE during active write workload (transaction pooling) - ✓ RELOAD with backend database change (postgres1 → postgres2) - ✓ RESUME with immediate write recovery - ✓ No connection hangs - ✓ No data loss or sequence gaps - ✓ 47 successful writes within 5 seconds post-RESUME Test results from production-like switchover scenario: ``` Recent writes (last 5s): postgres1: 0 postgres2: 47 ✓ postgres2 is receiving writes, postgres1 is not - SWITCHOVER SUCCESSFUL! ✓ Writer recovered successfully (iteration 200 → 300) ✓ No gaps in iteration sequence ``` ## Files Changed - `src/pool.rs`: Made `resume()` async, added validation logic - `src/admin.rs`: Updated `resume()` call sites to await - `src/config.rs`: Added old pool resume before RELOAD, imported `get_all_pools` - `tests/ruby/pause_new_connections_spec.rb`: Added comprehensive test ## Impact This fix enables: - Zero-downtime database migrations using PgCat - Safe hot switchover between primary/replica or different databases - Reliable PAUSE/RELOAD/RESUME workflow in production environments - Compatibility with transaction pooling mode during switchovers ## Related Issues This addresses the issue described in FIX.md regarding broken PAUSE/RESUME support where new client connections would hang indefinitely during PAUSE operations. 🤖 Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit fixes a critical deadlock issue where client connections would hang indefinitely during database switchover operations involving PAUSE, RELOAD, and RESUME commands in transaction pooling mode.
Problem
When performing a hot database switchover (PAUSE → RELOAD → RESUME), client applications would become permanently stuck when trying to establish new connections, even after RESUME completed. This made PgCat unsuitable for zero-downtime database migration scenarios.
The issue had two root causes:
Pool Reference Deadlock: When RELOAD creates a new pool object, clients waiting on the old pool's
paused_waiterwere never woken up when RESUME was called on the new pool. This caused permanent deadlock because:OLD_pool.wait_paused().awaitpaused_waiterNEW_pool.resume()Unvalidated Pools After RELOAD: New pools created during PAUSE were not validated before use, potentially causing authentication to block if validation was triggered during client connection.
Solution
This fix implements a two-part solution:
Part 1: Make resume() async and validate pools (pool.rs)
resume()from sync to async functionPart 2: Resume old pools before RELOAD (config.rs)
wait_paused()pool = self.get_pool()refresh lineTesting
Unit Tests: All 38 unit tests + 4 doc tests pass
cargo test- all tests passingcargo fmt- code properly formattedcargo clippy- no warningsIntegration Tests: Real-world database switchover scenario
Test results from production-like switchover scenario:
Files Changed
src/pool.rs: Maderesume()async, added validation logicsrc/admin.rs: Updatedresume()call sites to awaitsrc/config.rs: Added old pool resume before RELOAD, importedget_all_poolstests/ruby/pause_new_connections_spec.rb: Added comprehensive testImpact
This fix enables:
Related Issues
This addresses the issue described in FIX.md regarding broken PAUSE/RESUME support where new client connections would hang indefinitely during PAUSE operations.
🤖 Co-Authored-By: Claude noreply@anthropic.com