Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzing: ClusterFuzz integration #7079

Open
wants to merge 78 commits into
base: main
Choose a base branch
from

Conversation

kripken
Copy link
Member

@kripken kripken commented Nov 14, 2024

The main addition here is a bundle_clusterfuzz.py script which will package up
the exact files that should be uploaded to ClusterFuzz. It also documents the
process and bundling and testing. You can do

bundle.py OUTPUT_FILE.tgz

That bundles wasm-opt from ./bin., which is enough for local testing. For
actually uploading to ClusterFuzz, we need a portable build, and @dschuff
had the idea to reuse the emsdk build, which works nicely. Doing

bundle.py OUTPUT_FILE.tgz --build-dir=/path/to/emsdk/upstream/

will bundle wasm-opt (+libs) from the emsdk. I verified that those builds
work on ClusterFuzz.

I added several forms of testing here. First, our main fuzzer fuzz_opt.py now
has a ClusterFuzz testcase handler, which simulates a ClusterFuzz environment.
Second, there are smoke tests that run in the unit test suite, and can also be
run separately:

python -m unittest test/unit/test_cluster_fuzz.py

Those unit tests can also run on a given bundle, e.g. one created from an
emsdk build, for testing right before upload:

BINARYEN_CLUSTER_FUZZ_BUNDLE=/path/to/bundle.tgz python -m unittest test/unit/test_cluster_fuzz.py

A third piece of testing is to add a --fuzz-passes test. That is a mode for
-ttf (translate random data into a valid wasm fuzz testcase) that uses random
data to pick and run a set of passes, to further shape the wasm. (--fuzz-passes
had no previous testing, and this PR fixes it and tidies it up a little, adding some
newer passes too).

Otherwise this PR includes the key run.py script that is bundled and then
executed by ClusterFuzz, basically a python script that runs wasm-opt -ttf [..]
to generate testcases, sets up their JS, and emits them.

fuzz_shell.js, which is the JS to execute testcases, will now check if it is
provided binary data of a wasm file. If so, it does not read a wasm file from
argv[1]. (This is needed because ClusterFuzz expects a single file for the
testcase, so we make a JS file with bundled wasm inside it.)

@@ -85,7 +85,7 @@
# Delete the argument, as importing |shared| scans it.
sys.argv.pop()

from test import shared
from test import shared # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor the shared argument parsing to use less global state so we don't have to dodge the linter like this?


./emsdk install tot

after which ./upstream/ (from the emsdk dir) will contain portable builds of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "portable" mean in this context?


2. Run the unit tests, which include smoke tests for our ClusterFuzz support:

python -m unittest test/unit/test_cluster_fuzz.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this script should run these smoke tests automatically?

Comment on lines +111 to +112
input_data_file_path = os.path.join(output_dir, '%d.input' % i)
wasm_file_path = os.path.join(output_dir, '%d.wasm' % i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
input_data_file_path = os.path.join(output_dir, '%d.input' % i)
wasm_file_path = os.path.join(output_dir, '%d.wasm' % i)
input_data_file_path = os.path.join(output_dir, f'{i}.input')
wasm_file_path = os.path.join(output_dir, f'{i}.wasm')

Comment on lines +114 to +115
# wasm-opt may fail to run in rare cases (when the fuzzer emits code it
# detects as invalid). Just try again in such a case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I've never seen this before. Should we put the retry logic in wasm-opt instead?

for i in range(0, N + 2):
fuzz_file = os.path.join(temp_dir.name, f'fuzz-binaryen-{i}.js')
flags_file = os.path.join(temp_dir.name, f'flags-binaryen-{i}.js')
# We actually emit the range [1, N], so 0 or N+1 should not exist.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a bit simpler and just emit files starting with 0?

#
# StructNew : 18
#
struct_news_regex = re.compile(r'StructNew(\s+):(\s+)(\d+)')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the parens for readability? They shouldn't be necessary IIUC.

shared.WASM_OPT + ['-all', '--metrics', binary_file, '-q'], text=True)

# Update with what we see.
struct_news = re.findall(struct_news_regex, metrics)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why findall? Don't we expect there to be only one StructNew line?

# Update with what we see.
struct_news = re.findall(struct_news_regex, metrics)
if not struct_news:
# No line is emitted when --metrics seens no struct.news.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# No line is emitted when --metrics seens no struct.news.
# No line is emitted when --metrics sees no struct.news.

if not struct_news:
# No line is emitted when --metrics seens no struct.news.
struct_news = [('', '', '0')]
seen_struct_news.append(int(struct_news[0][2]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can used a named group instead of a plain index to make this more robust. (I'm surprised 2 is the right group index. Shouldn't it be 3 because group 0 is the entire match?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants