-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start to schedule fuzz tasks on batch in OSS-Fuzz #4397
Conversation
This implementation is bad, I think the job weights are influenced by the project weights. We always run ffmpeg. |
There are a few contortions we had to do that are temporary:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! some initial questions/comments
|
||
@contextlib.contextmanager | ||
def make_pool(pool_size=POOL_SIZE): | ||
# Don't use processes on Windows and unittests to avoid hangs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why we hang? is this because of the start method for multiprocessing being "spawn" on windows? or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to be honest. The comment isn't really detailed enough for me to know exactly what happened. My mistake.
@@ -368,14 +368,19 @@ def __init__(self, | |||
eta=None, | |||
is_command_override=False, | |||
high_end=False, | |||
extra_info=None): | |||
extra_info=None, | |||
is_from_queue=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this also for testing only? this seems like a rather confusing flag to add here here.
Can you also explain a bit why you need this and how it's helpful for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for testing fuzzing on batch in prod.
Basically, this PR adds a fuzz task scheduler. But we need a mechanism to ensure these fuzz tasks get scheduled on batch instead of just run by regular bots.
This is only needed for that purpose and I will remove it when we have moved all of fuzzing to batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! Please add a tODO here in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
a714059
to
f4df493
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!!
continue | ||
assert preemptible_quota or cpu_quota | ||
|
||
if not preemptible_quota['limit']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain in a comment how this logic works? Why can we switch between preemptilbe and nonpreemptible quota?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to explain, but basically in our chrome instance (https://pantheon.corp.google.com/iam-admin/quotas?e=-13802955&mods=logs_tg_prod&project=google.com:clusterfuzz&pageState=(%22allQuotasTable%22:(%22f%22:%22%255B%255D%22))) there is no preemptible CPU quota, just one CPU quota. That means we need to obey the CPU quota when the preemptible CPU doesn't exist (ie. is 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, thanks! Clarifying that in a comment would be very helpful for future contributors.
configs/test/batch/batch.yaml
Outdated
# TODO(metzman): Come up with a better system than this. I think a system where | ||
# we have a list of zones and associated config (such as subnets) to pick where | ||
# to launch tasks is ideal. | ||
region: 'us-central1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's using this exactly? I couldn't find it by ctrl-fing 'region'
in this PR.
can you also explain a bit why we can't just go with "a system where
we have a list of zones and associated config (such as subnets) to pick where
to launch tasks is ideal." in this PR? Are there anything blockign this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's old. I removed it.
There's nothing blocking it to be honest. But I think it's a bit complex to check multiple regions for CPU usage instead of just one.
I think we can simplify our implementation if we just force all preemptible (fuzz) tasks into us-central1 and all non-preemptible tasks (everything else) into us-east4. We won't need to do anything complicated to pick the region and we will never starve the non-fuzz tasks (and will have great latency for these). The problem with this approach is:
- Cloud might not appreciate upping our CPU usage by 50k in one region and prefer we split the workload.
- We may want to fuzz less if we are doing a lot of other tasks, the approach I give above means that our spending is uncapped, it's whatever fuzzing capacity is + how many non-fuzz tasks are requested).
@@ -368,14 +368,19 @@ def __init__(self, | |||
eta=None, | |||
is_command_override=False, | |||
high_end=False, | |||
extra_info=None): | |||
extra_info=None, | |||
is_from_queue=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! Please add a tODO here in that case.
# TODO(metzman): Handle high end. | ||
# A job's weight is determined by its own weight and the weight of the | ||
# project is a part of. First get project weights. | ||
logs.info('Getting projects.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it more useful to instead log the total weight we computed?
i.e. "Computed total project CPU weight of X from Y projects"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These logs are sort of only for tracking control flow, they probably can be removed now that I know this code works pretty well. Should I?
I don't think the weight thing is important to log because i don't think it will be a meaningful number. The total is just a denominator basically right?
return 2 | ||
|
||
|
||
class FuzzerJob: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit confusing. can we rename this something to differentiate it more from data_types.FuzzerJob? Something like "FuzzTaskCandidate" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. But I'm not really sure it's improved, it's a FuzzerJob joined with some fields from Job that we won't save.
Start to schedule fuzz tasks on batch in OSS-Fuzz
The scheduler will work differenly in OSS-Fuzz and Chrome. This only implements and OSS-Fuzz version.
This version will use job and project weights to decide which fuzzing jobs to schedule. It then adds these tasks
to the queue for other bots to preprocess and then for the utask_main_scheduler to actually schedule on batch.
For now, we will only do this for 100 CPUs.