-
-
Notifications
You must be signed in to change notification settings - Fork 104
fix: use RLock in ResourceScheduler to prevent deadlock #299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use RLock in ResourceScheduler to prevent deadlock #299
Conversation
allocate_resources() acquires self.lock and then calls allocate_cpu(), allocate_memory(), and allocate_gpu(), each of which also acquire self.lock. With a non-reentrant threading.Lock this causes a deadlock whenever build_knowledge_base() triggers the pipeline resource allocation path. Switch to threading.RLock() so the same thread can re-enter the lock. Co-authored-by: Cursor <cursoragent@cursor.com>
Review Summary by QodoFix ResourceScheduler deadlock by using RLock instead of Lock
WalkthroughsDescription• Replace non-reentrant Lock with RLock to prevent deadlock • Allows same thread to re-enter lock in nested calls • Fixes deadlock in build_knowledge_base() pipeline execution Diagramflowchart LR
A["allocate_resources<br/>acquires lock"] -- "calls" --> B["allocate_cpu<br/>tries to acquire lock"]
A -- "calls" --> C["allocate_memory<br/>tries to acquire lock"]
A -- "calls" --> D["allocate_gpu<br/>tries to acquire lock"]
E["Lock<br/>non-reentrant"] -.->|"causes deadlock"| F["Thread blocks itself"]
G["RLock<br/>reentrant"] -.->|"allows re-entry"| H["Thread proceeds"]
File Changes1. semantica/pipeline/resource_scheduler.py
|
Code Review by Qodo
1. Silent allocation failure
|
| self.resources: Dict[str, Resource] = {} | ||
| self.allocations: Dict[str, ResourceAllocation] = {} | ||
| self.lock = threading.Lock() | ||
| self.lock = threading.RLock() # RLock: allocate_resources holds lock and calls allocate_cpu/memory/gpu which also acquire it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Silent allocation failure 🐞 Bug ✓ Correctness
• With the new RLock, allocate_resources no longer self-deadlocks and will proceed, but it can return partial/empty allocations when capacity is insufficient. • ExecutionEngine does not validate allocations before running steps, so under load pipelines can execute without acquiring requested CPU/memory/GPU (oversubscription / broken scheduling semantics). • allocate_resources still reports tracking status as "completed" even when required resources weren’t allocated, reducing operator visibility into the failure mode.
Agent Prompt
## Issue description
`ResourceScheduler.allocate_resources()` can return an empty/partial allocations dict when a requested resource can’t be allocated (allocate_cpu/memory/gpu return `None`). `ExecutionEngine` proceeds to execute the pipeline anyway, so under contention the scheduler fails to enforce resource limits.
## Issue Context
This PR changes the scheduler lock to `threading.RLock()`, which makes `allocate_resources()` actually execute (previously it would self-deadlock due to nested locking). That makes the silent-allocation-failure path operational.
## Fix Focus Areas
- semantica/pipeline/resource_scheduler.py[180-240]
- semantica/pipeline/resource_scheduler.py[248-290]
- semantica/pipeline/resource_scheduler.py[343-386]
- semantica/pipeline/execution_engine.py[171-176]
## Suggested approach
1. In `allocate_resources`, treat `cpu_cores` / `memory_gb` / `gpu_device` options as required when provided (and defaults likely required too).
2. If any required allocation returns `None`, immediately `release_resources()` for any already-acquired allocations in this call and raise a `ProcessingError`/`ValidationError`.
3. In `ExecutionEngine.execute_pipeline`, handle allocation failure by returning a failed `ExecutionResult` (or implement retry/backoff/queueing if that’s the intended UX).
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
- Change threading.Lock() to threading.RLock() in ResourceScheduler.__init__ - Fixes deadlock in allocate_resources() when it calls allocate_cpu/memory/gpu - Each allocate_* method also acquires the same lock, causing re-entrancy issue - RLock allows same thread to re-enter lock without blocking itself - Resolves build_knowledge_base() hanging indefinitely Additional improvements: - Add allocation validation to prevent silent failures - Move progress tracking updates outside lock for better performance - Add comprehensive regression tests - Add explanatory comment for RLock usage Resolves: #299
Coordinated by: - @d4ndr4d3 (original author) - @cursoragent (review coordinator) - @qodo-code-review (automated review) Follow-up: Add @qodo-code-reviewer and @cursoragent as PR reviewers
Coordinated by: - @d4ndr4d3 (original author) - @qodo-code-review (automated review) Follow-up: Add @qodo-code-reviewer and @KaifAhmad1 as PR reviewers
Test Fixes: - Fixed isinstance() TypeError in test suite - Updated validation tests to match actual behavior - Added comprehensive test coverage (6/6 tests passing) - Added regression test for complete resource allocation failure Additional Improvements: - Allocation validation with ValidationError when no resources allocated - Performance optimization: moved progress tracking outside lock - Documentation: added explanatory comment for RLock usage - Comprehensive test suite in tests/test_resource_scheduler_deadlock.py Qodo Review Concerns Addressed: ✅ Silent allocation failure - now raises ValidationError ✅ Lock held during progress updates - moved outside lock ✅ Deadlock prevention - RLock allows re-entrant acquisition All tests passing, ready for merge.
- Change threading.Lock() to threading.RLock() in ResourceScheduler.__init__ - Fixes deadlock in allocate_resources() when it calls allocate_cpu/memory/gpu - Each allocate_* method also acquires the same lock, causing re-entrancy issue - RLock allows same thread to re-enter lock without blocking itself - Resolves build_knowledge_base() hanging indefinitely Test fixes and improvements: - Add allocation validation to prevent silent failures - Move progress tracking updates outside lock for better performance - Add comprehensive regression tests - Add explanatory comment for RLock usage Addresses Qodo review concerns: ✅ Silent allocation failure - now raises ValidationError ✅ Lock held during progress updates - moved outside lock ✅ Deadlock prevention - RLock allows re-entrant acquisition Resolves: #299
KaifAhmad1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Core Issue
ResourceScheduler deadlocks when allocate_resources() calls allocate_cpu/memory/gpu() due to nested Lock acquisition.
Fix Applied
self.lock = threading.RLock() # Line 112RLock allows same thread re-entrancy, preventing deadlock.
Testing Results
6/6 tests passing:
- Deadlock prevention
- Allocation validation
- Re-entrant lock behavior
- GPU allocation
- Edge case handling
Improvements Added
- Allocation Validation:
ValidationErrorwhen no resources allocated - Performance: Progress tracking moved outside lock
- Test Suite: Comprehensive regression tests
- Documentation: Clear RLock rationale
Qodo Review Concerns - All Resolved
| Concern | Status | Solution |
|---|---|---|
| Silent allocation failure | RESOLVED | ValidationError added |
| Lock held during progress updates | RESOLVED | Moved outside lock |
| Deadlock prevention | RESOLVED | RLock implemented |
Impact Assessment
- Fixes: Pipeline blocking deadlock affecting all users
- Maintains: Thread safety, backward compatibility, API unchanged
- Improves: Performance (reduced lock contention), error handling
Contributors
- @d4ndr4d3 - Original deadlock fix. Thanks for your contribution.
- @qodo-code-review - Automated review feedback
Decision: APPROVE
Critical bug fix with comprehensive testing, all review concerns addressed, and zero breaking changes. Ready for production deployment.
Summary
ResourceScheduler.allocate_resources()acquiresself.lock(line 205) and then callsallocate_cpu(),allocate_memory(), andallocate_gpu(), each of which also attempt to acquire the sameself.lock(lines 266, 305, 340).threading.Lock()(non-reentrant), this causes a deadlock wheneverbuild_knowledge_base()triggers the pipeline resource allocation path.threading.Lock()→threading.RLock()so the same thread can re-enter the lock without blocking itself.Reproduction
Change
One-line change in
semantica/pipeline/resource_scheduler.py:Test plan
build_knowledge_base()completes without deadlock after fix