Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup CI tests by bypassing CPython's difflib for large string comparisons? #1254

Open
pfackeldey opened this issue Jan 23, 2025 · 4 comments

Comments

@pfackeldey
Copy link
Contributor

pfackeldey commented Jan 23, 2025

We may want think about changing some tests from:

assert large_str1 == large_str2

to

equal = large_str1 == large_str2
assert equal

A few tests, e.g. this one: https://github.com/scikit-hep/coffea/blob/master/tests/test_dataset_tools.py#L379-L380, have very large strings to compare.

pytest uses CPython's difflib to create a diff of the left and right value if used with assert directly (this is super helpful for small strings). difflib is unfortunately pretty slow for large strings especially with small differences, see pytest-dev/pytest#8998.

I'm not sure if this is always triggered or only for failing tests. I had to tweak this to make a failing test run locally, otherwise I was stuck in a not-in-my-lifetime-ending for-loop in difflib.

(this comes at the cost of not having a diff, but that's not useful to read for very large strings anyway)

@pfackeldey pfackeldey added enhancement New feature or request and removed enhancement New feature or request labels Jan 23, 2025
@pfackeldey
Copy link
Contributor Author

Bypassing these comparisons helps a little bit, but somehow only for some tests when I run them locally.

Currently coffea's CI jobs take ~30min, which seems quite long. I'm currently developing integration-tests for awkward, uproot, vector, dask-awkward, and coffea. The CI jobs run all test suites and it takes in total ~35mins of which ~30mins are spent only in coffea's tests.

Is there a way to speed them up?

@lgray
Copy link
Collaborator

lgray commented Feb 7, 2025

A lot of the time is spent in setting up and tearing down dask clients (and also running jobs in them). We could speed things up by using a single client over the whole test and changing our sample root files to use zstd instead of lzma (but maybe not by much...). Though perhaps the best compression for these little test files is no compression (unless they are suddenly no longer little)?

I wouldn't skip doing these tests because they're largely representative of emergent bugs from integration and use cases, so they're important for catching nasty regressions.

@lgray
Copy link
Collaborator

lgray commented Feb 7, 2025

I'll profile some of the slower ones today and see if these hypotheses hold.

@pfackeldey
Copy link
Contributor Author

Thanks!

Not sure if that's somehow useful, but dask distributed uses a gen_cluster implementation for their test suite: https://github.com/dask/distributed/blob/main/distributed/utils_test.py#L868. Maybe this one optimizes the setup, tearing down, etc time by sharing them somehow already?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants