Enhance autoscaling during task rollover#18954
Enhance autoscaling during task rollover#18954Fly-Style wants to merge 8 commits intoapache:masterfrom
Conversation
8679300 to
59c076d
Compare
59c076d to
4c0ba02
Compare
74cc0a6 to
b6ccc54
Compare
|
|
||
| // Set the pending rollover flag - actual change applied in Phase 2 | ||
| // when ALL actively reading task groups have stopped | ||
| pendingRolloverTaskCount = rolloverTaskCount; |
There was a problem hiding this comment.
Is it possible that a scenario happens where the rolloverTaskCount is assigned x and the task groups we end up stopping is y and x > y (which seems possible due to max allowed stops config).
If this happens, during Phase 2, we will end up not applying the change at all and be stuck in a loop.
jtuglu1
left a comment
There was a problem hiding this comment.
Thanks! Left some initial comments
|
|
||
| // Set the pending rollover flag - actual change applied in Phase 2 | ||
| // when ALL actively reading task groups have stopped | ||
| pendingRolloverTaskCount = rolloverTaskCount; |
There was a problem hiding this comment.
Another question here is whether we want to invalidate these results if things have drastically changed during a roll-over. For supervisors running 100s of tasks (think 300-400 tasks) a roll-over period might take 10-15mins depending on how aggressive you are (also depends on how many other concurrent supervisors you have running). Naturally, lag will spike much higher on roll-over so perhaps it's worth considering some sort of sanity check before committing this result?
There was a problem hiding this comment.
Actually, I thought and make scale down during rollover as configurable option here: #18958.
Definitely, there are cases when we may not want to use this option and scale down during task runtime, as usual.
| } | ||
| log.info( | ||
| "Stopping taskGroup[%d] for autoscaler rollover to [%d] tasks.", | ||
| groupId, rolloverTaskCount |
There was a problem hiding this comment.
maybe log both stoppedForRollover and availableStops?
| maybeApplyPendingScaleRollover(); | ||
|
|
||
| checkPendingCompletionTasks(); | ||
|
|
There was a problem hiding this comment.
Should this be called after checkPendingCompletionTasks()?
There was a problem hiding this comment.
maybeApplyPendingScaleRollover is gated only on activelyReadingTaskGroups being empty, and it’s safe even if there are still publishing tasks. Applying the taskCount change earlier (before you prune/kill pending completion groups) ensures that the new allocation is staged as soon as it’s safe and avoids extra cycles where you keep old taskCount despite no active readers.
There was a problem hiding this comment.
Let's update the maybeApplyPendingScaleRollover() function comments then?
* This method is called after {@link #checkPendingCompletionTasks()} to check if a pending
* scale rollover can be applied. The scale is only applied when:
* <ul>
* <li>A pending rollover was set up in {@link #checkTaskDuration()} (Phase 1)</li>
* <li>All actively reading task groups have stopped (moved to pendingCompletionTaskGroups)</li>
* </ul>
* <p>
* By deferring the taskCount change until all old tasks have stopped, we avoid
* partition allocation mismatches that would cause {@link #discoverTasks()} to kill
* publishing tasks on the next cycle.There was a problem hiding this comment.
That's a good catch! Thanks in advance!
This patch improves the autoscaling during task rollover. Previously, despite the alorithmic correctness, the emptyness of
activelyReadingTaskGroupswas a very rare event in reality (but that accidentally happened to me during the manual testing of this feature), and the task scaling down happened very rarely as well, which is inappropriate given the facts that:UPD: to not to block cost-based autoscaler to be used in production, there is a parralel patch allowing the scaledown during task runtime: #18958
Changeset
maxAllowedStopsto avoid worker exhaustion and a disruptive ingestion stop.Key changed class in this PR
SeekableStreamSupervisorAlgorithm
checkTaskDuration()) :taskCount, if neededpendingRolloverTaskCount, stop tasks (we respectmaxAllowedStops)maybeApplyPendingScaleRollover())Full flow is available on the diagram under the spoiler.
Details
sequenceDiagram participant S as Supervisor participant ATG as activelyReadingTaskGroups participant PCG as pendingCompletionTaskGroups participant AS as AutoScaler Note over S: Cycle 1 - Scale triggered S->>ATG: 8 task groups running S->>AS: computeTaskCountForRollover() AS-->>S: returns 4 S->>S: pendingRolloverTaskCount = 4 S->>ATG: Stop 3 groups (maxAllowedStops) S->>PCG: Move 3 groups to publishing Note over ATG: 5 groups remain Note over S: Cycle 2 - Continue stopping S->>PCG: 1 publishing complete S->>ATG: Stop 3 more groups S->>PCG: Move to publishing Note over ATG: 2 groups remain Note over S: Cycle 3 - Phase 2 triggers S->>PCG: 3 publishing complete S->>ATG: Stop remaining 2 groups S->>PCG: Move to publishing Note over ATG: EMPTY - Phase 2 triggers! S->>S: changeTaskCountInIOConfig(4) S->>S: clearAllocationInfo() S->>S: pendingRolloverTaskCount = null Note over S: Cycle 4 - New tasks created S->>S: updatePartitionDataFromStream() Note over S: Rebuild partitionGroups with taskCount=4 S->>ATG: Create 4 new task groups Note over S: All 4 tasks start (capacity available)docker-compose.yml for those who are interested to test it themselves (build Druid images with tag
local) :Details
This PR has: