Skip to content

Conversation

@javanlacerda
Copy link
Collaborator

@javanlacerda javanlacerda commented Feb 1, 2026

SHOULD BE MERGED AFTER #5150.

It updates the logic for scheduling fuzz tasks. Instead of aiming to load the GCP Batch infrastructure based on the available CPUs, it looks only to the preprocess queue size.

By default, it aims to keep the preprocess queue with 10k messages, creating the number of tasks based on the difference between the aim value and the current value.

image

@javanlacerda javanlacerda marked this pull request as ready for review February 4, 2026 16:51
@jonathanmetzman
Copy link
Collaborator

I'm not going to do a thorough review but will share context to avoid a production incident.
I think it's likely there will be infinite queuing if #5140 is not landed before this.


def count_unacked(creds, project_id, subscription_id):
"""Counts the unacked messages in |subscription_id|."""
def get_queue_size(creds, project_id, subscription_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth notiing somewhere that the queue size metric is delayed by about 5 minutes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I got your point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean that we should mention the need for delays. If the cron runs too often, it might check the queue size before the metric has actually updated to reflect the jobs it just added. Does this make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. But the idea is to tweak this by using the feature flags to have the balance on the size of the queue and the frequency of the cron execution.

@jonathanmetzman
Copy link
Collaborator

I'd run this locally using butler scripts to make sure it behaves nicely.

@javanlacerda
Copy link
Collaborator Author

javanlacerda commented Feb 4, 2026

I'd run this locally using butler scripts to make sure it behaves nicely.

Its running in dev for 4 days already :)

@javanlacerda
Copy link
Collaborator Author

I'm not going to do a thorough review but will share context to avoid a production incident. I think it's likely there will be infinite queuing if #5140 is not landed before this.

Will not because this should be landed only after #5150. And even we don't have the job limiter for batch yet, we can control how many tasks will be forward to there by the RemoteTaskGate frequencies.

@@ -223,7 +167,7 @@ def get_fuzz_tasks(self) -> Dict[str, tasks.Task]:
weights.append(fuzz_task_candidate.weight)

# TODO(metzman): Handle high-end jobs correctly.
Copy link
Contributor

@decoNR decoNR Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarification, how do these new changes relate to this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what @jonathanmetzman meant here, but AFAIK we doesn't support high-end jobs anymore.


weights = [candidate.weight for candidate in fuzz_task_candidates]
num_instances = int(self.num_cpus / self._get_cpus_per_fuzz_job(None))
num_instances = self.num_tasks
Copy link
Contributor

@decoNR decoNR Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change this variable name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to fuzz_tasks

subconf['name'] for subconf in batch_config.get(
'mapping.LINUX-PREEMPTIBLE-UNPRIVILEGED.subconfigs')
}
PREPROCESS_TARGET_SIZE_DEFAULT = 10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to read if this is at the top of the file. Is there a specific reason for it to be here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be in the top. Moving.


# TODO(metzman): Handle high-end jobs correctly.
num_instances = int(self.num_cpus / self._get_cpus_per_fuzz_job(None))
num_instances = self.num_tasks
Copy link
Contributor

@decoNR decoNR Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds good. I'll move it to fuzz_tasks

conf = local_config.ProjectConfig()
max_cpus_per_schedule = conf.get('max_cpus_per_schedule')
if max_cpus_per_schedule:
max_tasks = int(max_cpus_per_schedule / CPUS_PER_FUZZ_JOB)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be deprecated? It appears to be used only here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. I'll remove it

@cemon721-a11y
Copy link

​"Hi Javan,
​As a Research Analysis Researcher, I've been following this thread. I'm concerned about how the 'infinite queuing' risk or the manual control of 'RemoteTaskGate frequencies' might affect the consistency of our data processing and overall system stability.
​Could you please confirm if I should pause any ongoing research tasks or data extraction until PR #5140 and #5150 are fully landed? I want to ensure that our research outputs aren't compromised by these technical dependencies.
​Thanks for the context!"

@jonathanmetzman
Copy link
Collaborator

jonathanmetzman commented Feb 4, 2026

I'm not going to do a thorough review but will share context to avoid a production incident. I think it's likely there will be infinite queuing if #5140 is not landed before this.

Will not because this should be landed only after #5150. And even we don't have the job limiter for batch yet, we can control how many tasks will be forward to there by the RemoteTaskGate frequencies.

Great. I haven't had a chance to review/understand that PR, but are you relying sending everything to k8s to avoide infinite queuing in batch? Otherwise I don't see how we ever know if batch is full/queueing without querying the batch API. I trust you!

@cemon721-a11y
Copy link

cemon721-a11y commented Feb 4, 2026 via email

Signed-off-by: Javan Lacerda <javanlacerda@google.com>

fixes

Signed-off-by: Javan Lacerda <javanlacerda@google.com>
@javanlacerda javanlacerda force-pushed the javan.schedule-fuzz-queue-size branch from c58e4bb to ffb5cc7 Compare February 4, 2026 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants