Skip to content

Conversation

@alltheseas
Copy link
Collaborator

@alltheseas alltheseas commented Feb 2, 2026

Summary

Fix data race on isResumed flag in NostrNetworkManager.awaitConnection().

The previous implementation had two issues:

  1. isResumed local variable was accessed without synchronization between timeout task and resumeConnectionContinuation()
  2. Race between checking isConnected and registering the continuation could cause unnecessary timeout waits

Changes:

  • Eliminate isResumed flag - use continuation dictionary as source of truth
  • Check isConnected atomically inside the lock when storing continuation
  • Use removeValue(forKey:) to atomically remove and return continuation

Checklist

Standard PR Checklist

  • I have read (or I am familiar with) the Contribution Guidelines
  • I have tested the changes in this PR
  • I have profiled the changes to ensure there are no performance regressions, or I do not need to profile the changes.
    • Not needed: Fix addresses threading correctness, not performance
  • I have opened or referred to an existing github issue related to this change.
  • My PR is either small, or I have split it into smaller logical commits that are easier to review
  • I have added the signoff line to all my commits.
  • I have added appropriate changelog entries for the changes in this PR.
  • I have added appropriate Closes: or Fixes: tags in the commit messages wherever applicable.

Test report

Device: iPhone 17 Pro Simulator

iOS: 26.2

Damus: fix/network-layer-races branch

Setup: Thread Sanitizer enabled for race detection

Steps:

  1. Run xcodebuild test -scheme damus -destination 'platform=iOS Simulator,name=iPhone 17 Pro' -only-testing:damusTests/NostrNetworkManagerConcurrencyTests -only-testing:damusTests/NostrNetworkManagerIntegrationTests -enableThreadSanitizer YES

Results:

  • PASS

Unit Tests (4 tests)

Test Result
testAwaitConnection_ConcurrentCalls_NoDoubleResume ✅ passed
testAwaitConnection_TimeoutRacesWithConnect_NoDoubleResume ✅ passed
testAwaitConnection_StressTest_ManyIterations ✅ passed
testAwaitConnection_AlreadyConnected_ImmediateReturn ✅ passed

Integration Tests (8 tests)

Test Result
testConcurrentAwaitConnectionCalls ✅ 0.243s
testRapidConnectDisconnectCycles ✅ 0.511s
testConcurrentSubscriptions ✅ 0.330s
testConcurrentSubscriptions_StressTest ✅ 0.351s
testEventFlowIntegrity ✅ 0.452s
testSubscriptionCancellation ✅ 1.518s
testBackgroundForegroundTransitions ✅ 0.800s
testConcurrentOpsWithLifecycleTransitions ✅ 0.359s

Total: 12 tests, 0 failures, ~4.6 seconds

Other notes

Tests designed for Thread Sanitizer detection of data races.

Closes #3572

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Added comprehensive test coverage for network connection handling, including concurrency and integration scenarios to improve reliability and stability.

@coderabbitai
Copy link

coderabbitai bot commented Feb 2, 2026

📝 Walkthrough

Walkthrough

This pull request refactors NostrNetworkManager's connection continuation handling to eliminate a data race on the isResumed flag. The continuations dictionary is now the sole source of truth for managing pending continuations, unifying the timeout and connect code paths to call a single-argument resumeConnectionContinuation method, preventing double-resume crashes. Comprehensive concurrency and integration tests are added.

Changes

Cohort / File(s) Summary
NostrNetworkManager Continuation Refactoring
damus/Core/Networking/NostrNetworkManager/NostrNetworkManager.swift
Removed isResumed inout parameter from resumeConnectionContinuation. Dictionary now serves as single source of truth for pending continuations. Immediate resume for already-connected state now occurs within the lock without storing continuation. Both timeout and connect paths unified to rely on dictionary state to avoid double-resume.
Project Configuration
damus.xcodeproj/project.pbxproj
Added two new test files to the test target: NostrNetworkManagerConcurrencyTests.swift and NostrNetworkManagerIntegrationTests.swift with corresponding build file and file reference entries.
Concurrency Test Suite
damusTests/NostrNetworkManagerConcurrencyTests.swift
New test file with ConnectionContinuationTestHarness simulating thread-safe continuation dictionary management. Tests validate concurrent awaitConnection calls, timeout vs. connect races, stress conditions with multiple iterations, and immediate return when already connected—all without double-resume.
Integration Test Suite
damusTests/NostrNetworkManagerIntegrationTests.swift
New comprehensive integration test file covering connection lifecycle, concurrent subscriptions, event streaming, RelayPool and SubscriptionManager interactions, cancellation semantics, background/foreground transitions, race conditions, and timeout/failure scenarios with cleanup validation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant awaitConnection
    participant Dictionary as Continuations<br/>Dictionary
    participant Timeout as Timeout Task
    participant Connect as connect()

    Client->>awaitConnection: awaitConnection(timeout)
    
    alt Already Connected
        awaitConnection->>Client: Resume immediately
    else Not Connected
        awaitConnection->>Dictionary: Store continuation<br/>(under lock)
        Dictionary->>awaitConnection: Done
        
        par Timeout Path
            Timeout->>Timeout: Sleep(timeout duration)
            Timeout->>Dictionary: resumeConnectionContinuation<br/>(requestId)
            alt Continuation Exists
                Dictionary->>Timeout: Remove & resume
            else Already Resumed
                Dictionary->>Timeout: Not found, return
            end
        and Connect Path
            Connect->>Connect: Mark connected
            Connect->>Dictionary: Resume all pending<br/>continuations
            Dictionary->>Client: Resume all
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • PR #3553: Introduced the inout isResumed parameter that this PR refactors away, as part of the original continuation-based connection handling architecture.

Suggested labels

pr-active-review

Poem

🐰 A race upon the continuation track,
Where doubles crept and threads attacked!
Now dictionary holds the truth so dear,
No resumption twice—the path is clear! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly and concisely describes the main fix: eliminating the data race on the isResumed flag in NostrNetworkManager.
Description check ✅ Passed Description comprehensively covers the problem, solution, testing methodology, and follows the Standard PR Checklist template with all key sections completed.
Linked Issues check ✅ Passed The PR fully addresses all coding requirements from issue #3572: eliminates unsynchronized isResumed access, makes continuation removal atomic, removes TOCTOU race, and includes comprehensive concurrent tests.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the data race: modifications to NostrNetworkManager logic, project file updates for new tests, and two focused test files exercising concurrency and integration scenarios.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
damus/Core/Networking/NostrNetworkManager/NostrNetworkManager.swift (1)

95-121: ⚠️ Potential issue | 🟠 Major

Prevent a race between connection check and continuation registration.

If connect() flips isConnected to true after Line 96 but before Line 109, this call can wait until the timeout even though the connection is already established. Re-check (and/or insert) under the same lock, or resume immediately if already connected.

🛠️ Proposed fix (single locked check + immediate resume)
-        // Short-circuit if already connected
-        continuationsLock.lock()
-        let alreadyConnected = isConnected
-        continuationsLock.unlock()
-
-        guard !alreadyConnected else {
-            return
-        }
-
         let requestId = UUID()
         var timeoutTask: Task<Void, Never>?
 
         await withCheckedContinuation { (continuation: CheckedContinuation<Void, Never>) in
             // Store the continuation in a thread-safe manner
             continuationsLock.lock()
-            connectionContinuations[requestId] = continuation
-            continuationsLock.unlock()
+            if isConnected {
+                continuationsLock.unlock()
+                continuation.resume()
+                return
+            }
+            connectionContinuations[requestId] = continuation
+            continuationsLock.unlock()

@alltheseas alltheseas force-pushed the fix/network-layer-races branch from 42d01d9 to 6d3a2a2 Compare February 3, 2026 00:06
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@damusTests/NostrNetworkManagerConcurrencyTests.swift`:
- Around line 156-205: Add brief API docstrings for each helper method to
satisfy doc coverage: document awaitConnection(timeout:) to explain it
asynchronously waits for connection with optional timeout and how the
continuation is registered/cleared; document
resumeConnectionContinuation(requestId:) to state it resumes and removes the
stored continuation for the given UUID; document connect() to state it marks the
manager connected and resumes all pending continuations; and document reset() to
state it clears connection state and pending continuations. Place these comments
immediately above the corresponding function declarations (awaitConnection,
resumeConnectionContinuation, connect, reset).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@damusTests/NostrNetworkManagerIntegrationTests.swift`:
- Around line 27-36: Add brief docstrings above the setUpWithError() and
tearDownWithError() methods describing their purpose: setUpWithError should note
it initializes damusState for each test (mention generate_test_damus_state,
mock_profile_info and addNdbToRelayPool settings) and tearDownWithError should
note it cleans up by setting damusState to nil. Keep each comment one or two
sentences and follow the file's existing docstring style and formatting.
- Around line 73-89: The test testRapidConnectDisconnectCycles currently only
calls damusState.nostrNetwork.connect(), so add a matching disconnect call to
exercise the disconnect path: inside the loop after the brief Task.sleep(for:
.milliseconds(50)) call, invoke await damusState.nostrNetwork.disconnect() (and
optionally a short sleep after disconnect) so each cycle performs connect then
disconnect; update the loop in the test method to call disconnect on the
nostrNetwork instance.
- Around line 93-136: The testConcurrentSubscriptions pre-population quietly
no-ops if test_notes.jsonl is missing and performs blocking I/O on the
`@MainActor`; change it to fail fast if the fixture is missing (XCTFail and return
when fileURL is nil) and move the file read off the main thread by awaiting a
background task (e.g., let notesJSONL = try await Task.detached { try
String(contentsOf: fileURL, encoding: .utf8) }.value), then iterate the lines
and call damusState.ndb.processEvent(...) on the appropriate actor; keep the
rest of the test (including the stream loop on
damusState.nostrNetwork.reader.advancedStream) unchanged.
🧹 Nitpick comments (1)
damusTests/NostrNetworkManagerIntegrationTests.swift (1)

138-191: Move bulk pre-population off @MainActor to keep the main thread unblocked.

These tests do large synchronous ndb.processEvent loops on the MainActor. If Ndb is thread-safe, consider moving the pre-population work to a background task/queue and awaiting completion.

♻️ Example refactor (apply similarly where safe)
-        // Pre-populate
-        for i in 0..<20 {
-            guard let testNote = NostrEvent(
-                content: "Stress test note \(i)",
-                keypair: test_keypair,
-                kind: NostrKind.text.rawValue,
-                tags: []
-            ) else { continue }
-
-            let eventJson = encode_json(testNote)!
-            _ = damusState.ndb.processEvent("[\"EVENT\",\"subid\",\(eventJson)]")
-        }
+        // Pre-populate off MainActor (if Ndb is thread-safe).
+        let ndb = damusState.ndb
+        await Task.detached {
+            for i in 0..<20 {
+                guard let testNote = NostrEvent(
+                    content: "Stress test note \(i)",
+                    keypair: test_keypair,
+                    kind: NostrKind.text.rawValue,
+                    tags: []
+                ) else { continue }
+
+                let eventJson = encode_json(testNote)!
+                _ = ndb.processEvent("[\"EVENT\",\"subid\",\(eventJson)]")
+            }
+        }.value

As per coding guidelines: Never block the main thread: All network requests, database queries, and expensive computations must run on background threads/queues using Task { }, DispatchQueue.global(), or Swift concurrency (async/await).

Also applies to: 195-248, 252-302, 330-380

alltheseas added a commit to alltheseas/damus that referenced this pull request Feb 3, 2026
…fixes"

This reverts commit 0098c4a as these tests belong in their respective PR branches:
- NostrNetworkManagerIntegrationTests.swift → PR2 damus-io#3586
- WalletModelIntegrationTests.swift → PR3 damus-io#3587
- FilterStateConcurrencyTests.swift → PR4 damus-io#3588

Signed-off-by: alltheseas <alltheseas@users.noreply.github.com>
@alltheseas alltheseas force-pushed the fix/network-layer-races branch from 3d09763 to e014269 Compare February 3, 2026 04:39
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@damusTests/NostrNetworkManagerIntegrationTests.swift`:
- Around line 101-151: The test currently calls damusState.ndb.processEvent(...)
on the main actor, which performs synchronous work through withNdb() →
keepNdbOpen() → ndb_process_event and can block the main thread; move the entire
pre-population loop that iterates notesJSONL.split(...) into a background task
(e.g. Task.detached { ... } or DispatchQueue.global().async) so processEvent
runs off the main actor, await its completion before continuing the test, and
apply the same change to the other test methods that pre-populate the DB.

@alltheseas
Copy link
Collaborator Author

Agent Requirements Compliance Review

Requirement Status Notes
Simplicity Removed isResumed flag, uses dictionary as source of truth
Logically distinct commits Single focused commit for the fix
Standalone commits Each commit builds and runs
Human readable Clear variable names, documented logic
Human reviewable Small diff, focused change
Docstring coverage All test methods and harness documented
PR template followed Summary, Test Plan, Checklist present
Nevernesting Uses early returns in continuation handling
nostrdb consideration N/A Network layer, not storage
Threading Fix ensures no main thread blocking
Prefer actors over locks ⚠️ Uses NSLock (matches existing NostrNetworkManager pattern)
Cherry-pick preservation N/A Original work
Rebased fixes Review feedback incorporated via rebase
No duplicate code Reuses existing lock patterns
Test coverage 12 tests: unit, integration, stress
No CI fudging All tests pass legitimately

Note on actors vs locks: NSLock used for consistency with existing NostrNetworkManager synchronization. Refactoring to actors would be a larger change better suited for a dedicated PR.

@alltheseas alltheseas force-pushed the fix/network-layer-races branch from 0d230f6 to b92c769 Compare February 3, 2026 06:08
alltheseas and others added 2 commits February 3, 2026 00:17
The previous implementation had two issues:
1. isResumed local variable was accessed without synchronization between
   timeout task and resumeConnectionContinuation()
2. Race between checking isConnected and registering the continuation could
   cause unnecessary timeout waits

Changes:
- Eliminate isResumed flag - use continuation dictionary as source of truth
- Check isConnected atomically inside the lock when storing continuation
- Use removeValue(forKey:) to atomically remove and return continuation

Closes damus-io#3572
Changelog-Fixed: Fix data race in NostrNetworkManager connection handling

Signed-off-by: alltheseas <alltheseas@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive test coverage for the isResumed race fix:

Concurrency Tests (NostrNetworkManagerConcurrencyTests.swift - 4 tests):
- Concurrent awaitConnection calls without double-resume
- Timeout races with connect without double-resume
- Stress test with many iterations
- Already-connected immediate return

Integration Tests (NostrNetworkManagerIntegrationTests.swift - 8 tests):
- Concurrent awaitConnection calls
- Rapid connect/disconnect cycles
- Concurrent subscriptions
- Concurrent subscriptions stress test
- Event flow integrity
- Subscription cancellation
- Background/foreground transitions
- Concurrent ops with lifecycle transitions

Run with Thread Sanitizer for best results:
  xcodebuild test -enableThreadSanitizer YES

Signed-off-by: alltheseas <alltheseas@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: NostrNetworkManager.awaitConnection() data race on isResumed flag

1 participant