Add configurable option to scale-down during task run time for cost-based autoscaler#18958
Conversation
e689f4e to
e26868e
Compare
e26868e to
4fd0f86
Compare
capistrant
left a comment
There was a problem hiding this comment.
serde test needs fixing up it seems. in addition to that left some lower stakes comments questions. Overall I think it is mergeable as is, but I do wonder about best default for scaleDownDuringTaskRolloverOnly in the long term
| } else if (!config.isScaleDownOnTaskRolloverOnly() && optimalTaskCount < currentTaskCount && | ||
| (config.getScaleDownBarrier() == 0 || ++scaleDownCounter == config.getScaleDownBarrier())) { |
There was a problem hiding this comment.
nit suggestion: might be easier cognitively to drop the ||?
| } else if (!config.isScaleDownOnTaskRolloverOnly() && optimalTaskCount < currentTaskCount && | |
| (config.getScaleDownBarrier() == 0 || ++scaleDownCounter == config.getScaleDownBarrier())) { | |
| } else if (!config.isScaleDownOnTaskRolloverOnly() && optimalTaskCount < currentTaskCount && ++scaleDownCounter >= config.getScaleDownBarrier()) { |
| |`idleWeight`|The weight of extracted poll idle value in cost function. | No | 0.75 | | ||
| |`defaultProcessingRate`|A planned processing rate per task, required for first cost estimations. | No | 1000 | | ||
| |`scaleDownBarrier`| A number of successful scale down attempts which should be skipped to prevent the auto-scaler from scaling down tasks immediately. | No | 5 | | ||
| |`scaleDownDuringTaskRolloverOnly`| Indicates whether task scaling down is limited to periods during task rollovers only. | No | False | |
There was a problem hiding this comment.
Is it possible that we want this to be true as default, if that is like the production behavior we think will be preferred? I say if that is like the production behavior we think will be preferred because of this in the description:
To test the cost-based autoscaler in real-world scenarios, we temporarily allow scale-downs during task runtime
like instead of temporarily defaulting for testing, default for long term productionalization plans and we can override for folks testing behavior more rapidly?
There was a problem hiding this comment.
Actually, that's the case. Scale down during task rollover is now in unsatisfactory condition, and the existence of that PR with scaleDownDuringTaskRolloverOnly set to false is required for smooth autoscaler testing.
There is a parallel PR: #18954, I tried to integrate a new way to gradually stop tasks and scale down, but it is very risky and need a very careful review and testing.
Well, by bad, I did not produce a good PR description. I updated the it.
| this.idleWeight = Configs.valueOrDefault(idleWeight, DEFAULT_IDLE_WEIGHT); | ||
| this.defaultProcessingRate = Configs.valueOrDefault(defaultProcessingRate, DEFAULT_PROCESSING_RATE); | ||
| this.scaleDownBarrier = Configs.valueOrDefault(scaleDownBarrier, DEFAULT_SCALE_DOWN_BARRIER); | ||
| this.scaleDownDuringTaskRolloverOnly = Configs.valueOrDefault(scaleDownDuringTaskRolloverOnly, false); |
There was a problem hiding this comment.
nit: should this follow pattern of default at top of file or are we only doing that for magic numbers with context added in comments?
There was a problem hiding this comment.
Sorry, did not get your point :)
There was a problem hiding this comment.
Just wondering if inlining the default when the rest are constants at the top of the file for just this one config is intentional. I think it is no big deal
| { | ||
| private static final long DEFAULT_SCALE_ACTION_PERIOD_MILLIS = 15 * 60 * 1000; // 15 minutes | ||
| private static final long DEFAULT_MIN_TRIGGER_SCALE_ACTION_FREQUENCY_MILLIS = 1200000; // 20 minutes | ||
| private static final long DEFAULT_SCALE_ACTION_PERIOD_MILLIS = 10 * 60 * 1000; // 10 minutes |
There was a problem hiding this comment.
looks like this caused the default serde test to fail. I assume the one below would also fail assertions if not fixed up as well.
capistrant
left a comment
There was a problem hiding this comment.
thanks for the fixing up @Fly-Style. LGTM
| Assert.assertEquals(DEFAULT_SCALE_ACTION_PERIOD_MILLIS, config.getScaleActionPeriodMillis()); | ||
| Assert.assertEquals(DEFAULT_MIN_TRIGGER_SCALE_ACTION_FREQUENCY_MILLIS, config.getMinTriggerScaleActionFrequencyMillis()); | ||
| Assert.assertEquals(DEFAULT_LAG_WEIGHT, config.getLagWeight(), 0.001); | ||
| Assert.assertEquals(DEFAULT_IDLE_WEIGHT, config.getIdleWeight(), 0.001); | ||
| Assert.assertEquals(DEFAULT_PROCESSING_RATE, config.getDefaultProcessingRate(), 0.001); | ||
| Assert.assertEquals(DEFAULT_SCALE_DOWN_BARRIER, config.getScaleDownBarrier()); |
There was a problem hiding this comment.
Rather than importing the constants, it would be nice to verify the numerical values directly here, so that when the constants change, we consciously update the tests as well.
| |`idleWeight`|The weight of extracted poll idle value in cost function. | No | 0.75 | | ||
| |`defaultProcessingRate`|A planned processing rate per task, required for first cost estimations. | No | 1000 | | ||
| |`scaleDownBarrier`| A number of successful scale down attempts which should be skipped to prevent the auto-scaler from scaling down tasks immediately. | No | 5 | | ||
| |`scaleDownDuringTaskRolloverOnly`| Indicates whether task scaling down is limited to periods during task rollovers only. | No | False | |
There was a problem hiding this comment.
Thanks for adding this config, it makes sense to have this option. But I agree with @capistrant in that we would want the default of this flag to be true so that scale down happens only on task rollover by default. But we can fix it up once we have addressed the issues with the task rollover bit.
There was a problem hiding this comment.
That's a reason for this option's existence, actually :D
| |`lagWeight`|The weight of extracted lag value in cost function.| No| 0.25| | ||
| |`idleWeight`|The weight of extracted poll idle value in cost function. | No | 0.75 | | ||
| |`defaultProcessingRate`|A planned processing rate per task, required for first cost estimations. | No | 1000 | | ||
| |`scaleDownBarrier`| A number of successful scale down attempts which should be skipped to prevent the auto-scaler from scaling down tasks immediately. | No | 5 | |
There was a problem hiding this comment.
This config does not seem to be aligned with the whole cost-based story.
It is a little counterintuitive as it essentially negates the decision taken by the auto-scaler.
IIUC, this has been put in as a hack until we have fixed the issue with scaling on task rollover.
Please remove this config and we should try to evaluate the following instead:
- increase the window over which we compute the average lag
- AND/OR add more terms to the cost function
- AND/OR add a config like
minScaleDownDelay(which serves a role similar to thescaleDownBarrierbut in terms of time rather than count)
There was a problem hiding this comment.
minScaleDelay idea looks better, I like it more 👍🏻
To test the cost-based autoscaler in real-world scenarios, we temporarily allow scale-downs during task runtime. In case if we don't want to scale down immediately.
Two configuration options has been added:
scaleDownDuringTaskRolloverOnlyindicates if task scaling down is limited to periods during task rollovers only.falseby default, allows scaling down during normal task run time.scaleDownBarrier, A number of successful scale down attempts which should be skipped to allow prevention of the auto-scaler from scaling down tasks immediately when the computed cost-based metrics fall below this barrier. Setting it to0removes the threshold.There is a parallel PR that tries to implement the most correct behaviour for scaling during task rollover: #18954
This PR has: