Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion semantica/pipeline/resource_scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def __init__(self, config: Optional[Dict[str, Any]] = None, **kwargs):

self.resources: Dict[str, Resource] = {}
self.allocations: Dict[str, ResourceAllocation] = {}
self.lock = threading.Lock()
self.lock = threading.RLock() # RLock: allocate_resources holds lock and calls allocate_cpu/memory/gpu which also acquire it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Silent allocation failure 🐞 Bug ✓ Correctness

• With the new RLock, allocate_resources no longer self-deadlocks and will proceed, but it can
  return partial/empty allocations when capacity is insufficient.
• ExecutionEngine does not validate allocations before running steps, so under load pipelines can
  execute without acquiring requested CPU/memory/GPU (oversubscription / broken scheduling semantics).
• allocate_resources still reports tracking status as "completed" even when required resources
  weren’t allocated, reducing operator visibility into the failure mode.
Agent Prompt
## Issue description
`ResourceScheduler.allocate_resources()` can return an empty/partial allocations dict when a requested resource can’t be allocated (allocate_cpu/memory/gpu return `None`). `ExecutionEngine` proceeds to execute the pipeline anyway, so under contention the scheduler fails to enforce resource limits.

## Issue Context
This PR changes the scheduler lock to `threading.RLock()`, which makes `allocate_resources()` actually execute (previously it would self-deadlock due to nested locking). That makes the silent-allocation-failure path operational.

## Fix Focus Areas
- semantica/pipeline/resource_scheduler.py[180-240]
- semantica/pipeline/resource_scheduler.py[248-290]
- semantica/pipeline/resource_scheduler.py[343-386]
- semantica/pipeline/execution_engine.py[171-176]

## Suggested approach
1. In `allocate_resources`, treat `cpu_cores` / `memory_gb` / `gpu_device` options as required when provided (and defaults likely required too).
2. If any required allocation returns `None`, immediately `release_resources()` for any already-acquired allocations in this call and raise a `ProcessingError`/`ValidationError`.
3. In `ExecutionEngine.execute_pipeline`, handle allocation failure by returning a failed `ExecutionResult` (or implement retry/backoff/queueing if that’s the intended UX).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


self._initialize_resources()

Expand Down