Speedup CI tests by bypassing CPython's `difflib` for large string comparisons? #1254

pfackeldey · 2025-01-23T22:44:35Z

We may want think about changing some tests from:

assert large_str1 == large_str2

to

equal = large_str1 == large_str2
assert equal

A few tests, e.g. this one: https://github.com/scikit-hep/coffea/blob/master/tests/test_dataset_tools.py#L379-L380, have very large strings to compare.

pytest uses CPython's difflib to create a diff of the left and right value if used with assert directly (this is super helpful for small strings). difflib is unfortunately pretty slow for large strings especially with small differences, see pytest-dev/pytest#8998.

I'm not sure if this is always triggered or only for failing tests. I had to tweak this to make a failing test run locally, otherwise I was stuck in a not-in-my-lifetime-ending for-loop in difflib.

(this comes at the cost of not having a diff, but that's not useful to read for very large strings anyway)

The text was updated successfully, but these errors were encountered:

pfackeldey · 2025-02-07T14:22:23Z

Bypassing these comparisons helps a little bit, but somehow only for some tests when I run them locally.

Currently coffea's CI jobs take ~30min, which seems quite long. I'm currently developing integration-tests for awkward, uproot, vector, dask-awkward, and coffea. The CI jobs run all test suites and it takes in total ~35mins of which ~30mins are spent only in coffea's tests.

Is there a way to speed them up?

lgray · 2025-02-07T14:44:09Z

A lot of the time is spent in setting up and tearing down dask clients (and also running jobs in them). We could speed things up by using a single client over the whole test and changing our sample root files to use zstd instead of lzma (but maybe not by much...). Though perhaps the best compression for these little test files is no compression (unless they are suddenly no longer little)?

I wouldn't skip doing these tests because they're largely representative of emergent bugs from integration and use cases, so they're important for catching nasty regressions.

lgray · 2025-02-07T14:44:37Z

I'll profile some of the slower ones today and see if these hypotheses hold.

pfackeldey · 2025-02-07T14:48:48Z

Thanks!

Not sure if that's somehow useful, but dask distributed uses a gen_cluster implementation for their test suite: https://github.com/dask/distributed/blob/main/distributed/utils_test.py#L868. Maybe this one optimizes the setup, tearing down, etc time by sharing them somehow already?

pfackeldey added enhancement New feature or request and removed enhancement New feature or request labels Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup CI tests by bypassing CPython's `difflib` for large string comparisons? #1254

Speedup CI tests by bypassing CPython's `difflib` for large string comparisons? #1254

pfackeldey commented Jan 23, 2025 •

edited

Loading

pfackeldey commented Feb 7, 2025

lgray commented Feb 7, 2025

lgray commented Feb 7, 2025

pfackeldey commented Feb 7, 2025

Speedup CI tests by bypassing CPython's difflib for large string comparisons? #1254

Speedup CI tests by bypassing CPython's difflib for large string comparisons? #1254

Comments

pfackeldey commented Jan 23, 2025 • edited Loading

pfackeldey commented Feb 7, 2025

lgray commented Feb 7, 2025

lgray commented Feb 7, 2025

pfackeldey commented Feb 7, 2025

Speedup CI tests by bypassing CPython's `difflib` for large string comparisons? #1254

Speedup CI tests by bypassing CPython's `difflib` for large string comparisons? #1254

pfackeldey commented Jan 23, 2025 •

edited

Loading