Start to schedule fuzz tasks on batch in OSS-Fuzz #4397

jonathanmetzman · 2024-11-12T00:25:28Z

Start to schedule fuzz tasks on batch in OSS-Fuzz

The scheduler will work differenly in OSS-Fuzz and Chrome. This only implements and OSS-Fuzz version.
This version will use job and project weights to decide which fuzzing jobs to schedule. It then adds these tasks
to the queue for other bots to preprocess and then for the utask_main_scheduler to actually schedule on batch.
For now, we will only do this for 100 CPUs.

Add a cron job to run the scheduler every 15 minutes.
Improve region handling in batch (still far from complete).
Add function for bulk adding of tasks to queue for use by scheduler.
Make fuzz tasks less of a priority on batch than others.

jonathanmetzman · 2024-11-12T01:19:18Z

This implementation is bad, I think the job weights are influenced by the project weights. We always run ffmpeg.

jonathanmetzman · 2024-11-12T19:20:19Z

There are a few contortions we had to do that are temporary:

We needed to make it possible to run the scheduled fuzz tasks on batch instead of locally while preserving the ability for most fuzz tasks to be local.
We need to deal with many many queues in oss-fuzz.

oliverchang

nice! some initial questions/comments

oliverchang · 2024-11-12T21:29:46Z

src/clusterfuzz/_internal/base/concurrency.py

+
+@contextlib.contextmanager
+def make_pool(pool_size=POOL_SIZE):
+  # Don't use processes on Windows and unittests to avoid hangs.


Do you know why we hang? is this because of the start method for multiprocessing being "spawn" on windows? or something else?

I'm not sure to be honest. The comment isn't really detailed enough for me to know exactly what happened. My mistake.

infra/k8s/schedule-fuzz.yaml

oliverchang · 2024-11-12T21:34:45Z

src/clusterfuzz/_internal/base/tasks/__init__.py

@@ -368,14 +368,19 @@ def __init__(self,
               eta=None,
               is_command_override=False,
               high_end=False,
-               extra_info=None):
+               extra_info=None,
+               is_from_queue=False):


is this also for testing only? this seems like a rather confusing flag to add here here.

Can you also explain a bit why you need this and how it's helpful for testing?

It's for testing fuzzing on batch in prod.
Basically, this PR adds a fuzz task scheduler. But we need a mechanism to ensure these fuzz tasks get scheduled on batch instead of just run by regular bots.
This is only needed for that purpose and I will remove it when we have moved all of fuzzing to batch.

OK! Please add a tODO here in that case.

oliverchang

nice!!

infra/k8s/schedule-fuzz.yaml

oliverchang · 2024-11-14T03:30:11Z

src/clusterfuzz/_internal/cron/schedule_fuzz.py

+      continue
+  assert preemptible_quota or cpu_quota
+
+  if not preemptible_quota['limit']:


can you explain in a comment how this logic works? Why can we switch between preemptilbe and nonpreemptible quota?

I'm going to explain, but basically in our chrome instance (https://pantheon.corp.google.com/iam-admin/quotas?e=-13802955&mods=logs_tg_prod&project=google.com:clusterfuzz&pageState=(%22allQuotasTable%22:(%22f%22:%22%255B%255D%22))) there is no preemptible CPU quota, just one CPU quota. That means we need to obey the CPU quota when the preemptible CPU doesn't exist (ie. is 0).

ack, thanks! Clarifying that in a comment would be very helpful for future contributors.

oliverchang · 2024-11-14T03:32:29Z

configs/test/batch/batch.yaml

+# TODO(metzman): Come up with a better system than this. I think a system where
+# we have a list of zones and associated config (such as subnets) to pick where
+# to launch tasks is ideal.
+region: 'us-central1'


what's using this exactly? I couldn't find it by ctrl-fing 'region' in this PR.

can you also explain a bit why we can't just go with "a system where
we have a list of zones and associated config (such as subnets) to pick where
to launch tasks is ideal." in this PR? Are there anything blockign this?

It's old. I removed it.
There's nothing blocking it to be honest. But I think it's a bit complex to check multiple regions for CPU usage instead of just one.
I think we can simplify our implementation if we just force all preemptible (fuzz) tasks into us-central1 and all non-preemptible tasks (everything else) into us-east4. We won't need to do anything complicated to pick the region and we will never starve the non-fuzz tasks (and will have great latency for these). The problem with this approach is:

Cloud might not appreciate upping our CPU usage by 50k in one region and prefer we split the workload.

We may want to fuzz less if we are doing a lot of other tasks, the approach I give above means that our spending is uncapped, it's whatever fuzzing capacity is + how many non-fuzz tasks are requested).

oliverchang · 2024-11-14T03:33:50Z

src/clusterfuzz/_internal/base/tasks/__init__.py

@@ -368,14 +368,19 @@ def __init__(self,
               eta=None,
               is_command_override=False,
               high_end=False,
-               extra_info=None):
+               extra_info=None,
+               is_from_queue=False):


OK! Please add a tODO here in that case.

src/clusterfuzz/_internal/cron/schedule_fuzz.py

oliverchang · 2024-11-14T03:46:21Z

src/clusterfuzz/_internal/cron/schedule_fuzz.py

+    # TODO(metzman): Handle high end.
+    # A job's weight is determined by its own weight and the weight of the
+    # project is a part of. First get project weights.
+    logs.info('Getting projects.')


Is it more useful to instead log the total weight we computed?

i.e. "Computed total project CPU weight of X from Y projects"

These logs are sort of only for tracking control flow, they probably can be removed now that I know this code works pretty well. Should I?

I don't think the weight thing is important to log because i don't think it will be a meaningful number. The total is just a denominator basically right?

oliverchang · 2024-11-14T03:47:49Z

src/clusterfuzz/_internal/cron/schedule_fuzz.py

+    return 2
+
+
+class FuzzerJob:


this is a bit confusing. can we rename this something to differentiate it more from data_types.FuzzerJob? Something like "FuzzTaskCandidate" ?

Done. But I'm not really sure it's improved, it's a FuzzerJob joined with some fields from Job that we won't save.

jonathanmetzman added 7 commits November 11, 2024 15:56

Add fuzz task schedulign

24a395a

Fix

e319c0e

Fix

583fdd2

fix

a91c54a

tmp

1d6edab

fix

7f150bf

fix

ad12ab9

jonathanmetzman requested a review from oliverchang November 12, 2024 00:25

jonathanmetzman added 2 commits November 11, 2024 19:26

fix

d878d1f

fix

520b00a

jonathanmetzman marked this pull request as draft November 12, 2024 01:19

jonathanmetzman added 10 commits November 11, 2024 20:20

add logging, fix min, fix mapping, fix credentials

8bb7c29

fix

7f73ffd

merge

43c8970

fix

264576d

dont do queues to tasks it's way too slow

351ee78

add it back temporarily

e49fab4

tmp

b0b143a

Merge branch 'master' into fuzzschedule

e97c646

Fix

04d2a7b

fix

00fb294

jonathanmetzman requested a review from vitorguidi November 12, 2024 19:18

jonathanmetzman marked this pull request as ready for review November 12, 2024 19:18

jonathanmetzman added 4 commits November 12, 2024 14:22

Fix

ef9f787

fix

390b2c4

fix

7f2e03f

Implement fuzz task on batch properly

2606a05

oliverchang reviewed Nov 12, 2024

View reviewed changes

jonathanmetzman added 6 commits November 12, 2024 16:38

fix name

0ae8fd2

support cron API

c8ef0d4

lnt

85ef61c

fix

0a5d7bd

Merge branch 'master' into fuzzschedule

0951838

Make clearer

6579a22

jonathanmetzman force-pushed the master branch from a714059 to f4df493 Compare November 13, 2024 17:34

jonathanmetzman added 3 commits November 13, 2024 20:41

fix

fcab686

merge

211c939

Merge branch 'master' into fuzzschedule

878fe7b

oliverchang reviewed Nov 14, 2024

View reviewed changes

jonathanmetzman added 3 commits November 13, 2024 22:49

fix

e13e9b4

improve

f7db345

Fix

395360f

oliverchang approved these changes Nov 14, 2024

View reviewed changes

jonathanmetzman merged commit d0092e2 into master Nov 20, 2024
7 checks passed

jonathanmetzman deleted the fuzzschedule branch November 20, 2024 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start to schedule fuzz tasks on batch in OSS-Fuzz #4397

Start to schedule fuzz tasks on batch in OSS-Fuzz #4397

jonathanmetzman commented Nov 12, 2024

jonathanmetzman commented Nov 12, 2024

jonathanmetzman commented Nov 12, 2024

oliverchang left a comment

oliverchang Nov 12, 2024

jonathanmetzman Nov 12, 2024

oliverchang Nov 12, 2024

jonathanmetzman Nov 12, 2024

oliverchang Nov 14, 2024

jonathanmetzman Nov 14, 2024

oliverchang left a comment

oliverchang Nov 14, 2024

jonathanmetzman Nov 14, 2024

oliverchang Nov 14, 2024

oliverchang Nov 14, 2024 •

edited

Loading

jonathanmetzman Nov 14, 2024

oliverchang Nov 14, 2024

oliverchang Nov 14, 2024

jonathanmetzman Nov 14, 2024

oliverchang Nov 14, 2024

jonathanmetzman Nov 14, 2024

Start to schedule fuzz tasks on batch in OSS-Fuzz #4397

Start to schedule fuzz tasks on batch in OSS-Fuzz #4397

Conversation

jonathanmetzman commented Nov 12, 2024

jonathanmetzman commented Nov 12, 2024

jonathanmetzman commented Nov 12, 2024

oliverchang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverchang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverchang Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverchang Nov 14, 2024 •

edited

Loading