Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start to schedule fuzz tasks on batch in OSS-Fuzz #4397

Merged
merged 35 commits into from
Nov 20, 2024
Merged

Conversation

jonathanmetzman
Copy link
Collaborator

Start to schedule fuzz tasks on batch in OSS-Fuzz

The scheduler will work differenly in OSS-Fuzz and Chrome. This only implements and OSS-Fuzz version.
This version will use job and project weights to decide which fuzzing jobs to schedule. It then adds these tasks
to the queue for other bots to preprocess and then for the utask_main_scheduler to actually schedule on batch.
For now, we will only do this for 100 CPUs.

  1. Add a cron job to run the scheduler every 15 minutes.
  2. Improve region handling in batch (still far from complete).
  3. Add function for bulk adding of tasks to queue for use by scheduler.
  4. Make fuzz tasks less of a priority on batch than others.

@jonathanmetzman
Copy link
Collaborator Author

This implementation is bad, I think the job weights are influenced by the project weights. We always run ffmpeg.

@jonathanmetzman jonathanmetzman marked this pull request as draft November 12, 2024 01:19
@jonathanmetzman jonathanmetzman marked this pull request as ready for review November 12, 2024 19:18
@jonathanmetzman
Copy link
Collaborator Author

There are a few contortions we had to do that are temporary:

  1. We needed to make it possible to run the scheduled fuzz tasks on batch instead of locally while preserving the ability for most fuzz tasks to be local.
  2. We need to deal with many many queues in oss-fuzz.

Copy link
Collaborator

@oliverchang oliverchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! some initial questions/comments


@contextlib.contextmanager
def make_pool(pool_size=POOL_SIZE):
# Don't use processes on Windows and unittests to avoid hangs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we hang? is this because of the start method for multiprocessing being "spawn" on windows? or something else?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to be honest. The comment isn't really detailed enough for me to know exactly what happened. My mistake.

infra/k8s/schedule-fuzz.yaml Outdated Show resolved Hide resolved
@@ -368,14 +368,19 @@ def __init__(self,
eta=None,
is_command_override=False,
high_end=False,
extra_info=None):
extra_info=None,
is_from_queue=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this also for testing only? this seems like a rather confusing flag to add here here.

Can you also explain a bit why you need this and how it's helpful for testing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for testing fuzzing on batch in prod.
Basically, this PR adds a fuzz task scheduler. But we need a mechanism to ensure these fuzz tasks get scheduled on batch instead of just run by regular bots.
This is only needed for that purpose and I will remove it when we have moved all of fuzzing to batch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! Please add a tODO here in that case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Collaborator

@oliverchang oliverchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!

infra/k8s/schedule-fuzz.yaml Show resolved Hide resolved
continue
assert preemptible_quota or cpu_quota

if not preemptible_quota['limit']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain in a comment how this logic works? Why can we switch between preemptilbe and nonpreemptible quota?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to explain, but basically in our chrome instance (https://pantheon.corp.google.com/iam-admin/quotas?e=-13802955&mods=logs_tg_prod&project=google.com:clusterfuzz&pageState=(%22allQuotasTable%22:(%22f%22:%22%255B%255D%22))) there is no preemptible CPU quota, just one CPU quota. That means we need to obey the CPU quota when the preemptible CPU doesn't exist (ie. is 0).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, thanks! Clarifying that in a comment would be very helpful for future contributors.

# TODO(metzman): Come up with a better system than this. I think a system where
# we have a list of zones and associated config (such as subnets) to pick where
# to launch tasks is ideal.
region: 'us-central1'
Copy link
Collaborator

@oliverchang oliverchang Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's using this exactly? I couldn't find it by ctrl-fing 'region' in this PR.

can you also explain a bit why we can't just go with "a system where
we have a list of zones and associated config (such as subnets) to pick where
to launch tasks is ideal." in this PR? Are there anything blockign this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's old. I removed it.
There's nothing blocking it to be honest. But I think it's a bit complex to check multiple regions for CPU usage instead of just one.
I think we can simplify our implementation if we just force all preemptible (fuzz) tasks into us-central1 and all non-preemptible tasks (everything else) into us-east4. We won't need to do anything complicated to pick the region and we will never starve the non-fuzz tasks (and will have great latency for these). The problem with this approach is:

  1. Cloud might not appreciate upping our CPU usage by 50k in one region and prefer we split the workload.
  2. We may want to fuzz less if we are doing a lot of other tasks, the approach I give above means that our spending is uncapped, it's whatever fuzzing capacity is + how many non-fuzz tasks are requested).

@@ -368,14 +368,19 @@ def __init__(self,
eta=None,
is_command_override=False,
high_end=False,
extra_info=None):
extra_info=None,
is_from_queue=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! Please add a tODO here in that case.

src/clusterfuzz/_internal/cron/schedule_fuzz.py Outdated Show resolved Hide resolved
# TODO(metzman): Handle high end.
# A job's weight is determined by its own weight and the weight of the
# project is a part of. First get project weights.
logs.info('Getting projects.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it more useful to instead log the total weight we computed?

i.e. "Computed total project CPU weight of X from Y projects"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These logs are sort of only for tracking control flow, they probably can be removed now that I know this code works pretty well. Should I?

I don't think the weight thing is important to log because i don't think it will be a meaningful number. The total is just a denominator basically right?

return 2


class FuzzerJob:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit confusing. can we rename this something to differentiate it more from data_types.FuzzerJob? Something like "FuzzTaskCandidate" ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. But I'm not really sure it's improved, it's a FuzzerJob joined with some fields from Job that we won't save.

@jonathanmetzman jonathanmetzman merged commit d0092e2 into master Nov 20, 2024
7 checks passed
@jonathanmetzman jonathanmetzman deleted the fuzzschedule branch November 20, 2024 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants