Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak #168

Open
chimaerase opened this issue Jan 29, 2025 · 5 comments
Open

Possible memory leak #168

chimaerase opened this issue Jan 29, 2025 · 5 comments

Comments

@chimaerase
Copy link

chimaerase commented Jan 29, 2025

I've been using TPOTEstimator to evolve hyperparameters for my ML models, and it appears there's probably a memory leak in the TPOT code. My WIP code & data are closed-source, so I won't be able to definitively prove that that's the case, but filing a report since this might still be useful information.

Related + IMO confusing TPOT code

In a quick skim through the TPOT code, I notice this code block, with two near-identically named data members that could easily be confused with each other and cause a memory leak. Regardless of whether my example is actually a TPOT memory leak, I'd suggest using one client variable and a separate bool to track whether or not it was user-provided. These easily-confused variables seem likely to be a source of future bugs, if not the cause of a current one.


My (imperfect) example

I'm using 10 cores and my training data has shape (312_735, 240). Using TPOT2 0.1.9a0.

I'm attaching some runtime output 2025-01-29 Likely memory leak.txt, which is a little noisy, but my script doesn't do much other than use TPOT before failing (in this case). Many other tests with no code edits but different data & TPOT versions have worked. Some relevant log messages match up with my recipe below:

  • "Done searching Logistic Regression hyperparameter space in 1 h, 1 min, 42.00 s" -- indicates TPOT fit() has finished.
  • "Cross-validating the Logistic Regression model" -- causes an error immediately after in this case.

My Code

My WIP code basically does this:

  1. Load training data
  2. Preprocess it to drop some columns
  3. Run hyperparameter optimization using TPOT
    Some worker restart messages don't build confidence, but this step completes.
  4. Print the TPOT estimator (see "TPOTEstimator vars" in the attachment)
  5. Run cross-validation again
    I know TPOT is already doing this, just accounts for a different, optional code path where TPOT isn't involved, and in my case isn't too expensive to run again.
  6. Fail when trying to allocate new processes
    I'll look into providing my own Dask client to TPOT to work around this, but I think the potential for a TPOT memory leak is still worth reporting.

The TPOTEstimator portion of the code is:

    estimator = TPOTEstimator(
        classification=True,
        cv=cross_validator,
        generations=search_config.generations,
        n_jobs=search_config.n_jobs,
        population_size=search_config.population_size,
        random_state=child_seed,
        early_stop=search_config.early_stop,
        search_space=search_space,
        scorers=["f1"],  
        scorers_weights=[1], 
        scorers_early_stop_tol=search_config.early_stop_tolerance,
        verbose=4,
    )
    estimator.fit(X, y)  # Use genetic algorithm to explore hyperparameter space
    duration = utils.runtime_summary_str(start, datetime.now(UTC))
    print(f"Done searching {model_name} hyperparameter space in {duration}")
    print("TPOTEstimator vars:")
@perib
Copy link
Collaborator

perib commented Jan 29, 2025

Is your dataset large?

I've run into a similar issue. My working theory is that when a dataset is large and/or one of the tpot models has a transformer that blows up the memory further (like polynomialtransformer), it would then cause the dask future to crash. When the future crashes, for some reason it's not freeing it's memory. I've tried to resolve this in a recent PR #160 by manually forcing it to garbage collect every now and then. Which seemed to help the issue but not completely resolve it.

Ideally when a future goes over the memory limit, dask/tpot would just cancel that pipeline evaluation. In practice, I think the memory blows up too quickly in a single operation before dask can catch it. Which causes the future to crash, and may impact other futures simultaneously while it resolves. If all futures crash, then the whole thing fails (partly due to how data is scattered with dask).

some suggestions:

  • Using a linear search space, such as the one found here helps as well. Using search spaces that allow for the possibility of several transformers applied in sequence (like dynamiclinearpipeline or graphpipeline) increases the chances of poor sequences that blow up the size of the dataset and cause memory issues.
  • Another recommendation would be to set the memory limit parameter. In theory this should allow dask to cancel the future before it crashes if the memory explodes. In practice, what usually happens is a single transformation operation explodes the memory past system memory immediately before dask has a chance to notice and cleanly cancel the future.
  • Lowering the number of parallel threads (n_jobs) may help as well, though since it's usually a single thread going crazy, not sure how much of a difference it would make.
  • There are some lines in the code with a commented out # client.run(gc.collect) These were not included in the latest update as I didn't have time to thoroughly test them. But you could also try uncommenting these to increase the number of times dask attempts to clear the lost memory due to crashed futures. When eyeballing the dask dashboard, this does seem to help as well. At least on my machine, for some reason the tests on the github page failed when those were included, potentially added to the runtime.
  • are you running multiple different instances of TPOT? or running TPOT sequentially? Perhaps having multiple dask clients running simultaneously is causing an issue?

The most recent update to main contains the fixes I mentioned above (not sure if it was pushed to conda/pip yet. I would also recommend trying the latest version main on github to see if your performance is more stable.

My PR #160 contains some more examples that can reproduce this issue.

@chimaerase
Copy link
Author

Thanks for your quick reply, and for the suggestions! I'll try some of them, and some more context below on my previous run in case it's helpful.

I'm only running one instance of TPOT, so likely it's the sole consumer of system resources that are depleted after its done. I notice you didn't directly reply yet re: confusing variable names for the Daskclient, which I'd still suggest as a significant maintenance risk if not the cause of this.

My data

My data is fairly small, though I'm not clear on different ways it might be processed and transformed internally by scikit-learn and TPOT. This is categorical data encoded using dummy variables, so it's all booleans. I'm using 10 cores/jobs (x 76 MB = 760 MB), so the huge amount of memory (12GB) I bumped this Docker container to should presumably be way more than enough.

>>> df.shape
(312735, 240)
>>> size_bytes = df.memory_usage(deep=True).sum()
>>> size_mb = size_bytes / (1024 * 1024)
>>> size_mb
76.05307102203369

Environment

I was able to capture the memory and CPU traces from Docker after the fact, but still corresponding to the run from my previous attachment, and I'm including those here. I've seen explicit errors from crashed futures in previous runs with less memory, but those don't appear in my attachment, only warnings re: restarting workers.

Image

@perib
Copy link
Collaborator

perib commented Feb 4, 2025

I agree about the variable name... Though I don't think keeping track of the client variable is what is causing this particular issue.

Just to clarify, is TPOT completely crashing for you? Or does it still return an output?

Can you check est.evaluated_individuals? One issue that I have noticed is that at some point the dask client crashes and can't resume, and all subsequent pipelines fail. Since tpot doesn't know the dask client itself has failed, it simply marks all following pipelines as failed thinking the error is the pipeline's fault. Are you noticing that in your runs?

Its possible that (at least some) scikit-learn methods must convert the booleans to floats (at least for intermediate results), so the true size of the dataset could be 572MB if all variables need to be represented as floats. (E.g MLP would probably have everything turn into a float). It's possible that a smaller dataset size or more RAM (if possible) would resolve this.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.choice([0.0, 1.0], size=(312735, 240)))
size_bytes = df.memory_usage(deep=True).sum()
size_mb = size_bytes / (1024 * 1024)
size_mb -> 572

@chimaerase
Copy link
Author

Thanks again for the quick reply!

Sometimes TPOT is crashing, but in my original report and several similar runs, it completed. Then a subsequent attempt to run basic scikit-learn code for cross validation crashed due to lack of resources.

Still @ minimum 572 MB per worker X 10 workers, 12GB would presumably be enough. I'm trying to run again with 5 cores instead of 10, which should roughly halve the memory use. I've seen log messages from TPOT in earlier runs that seemed to indicate what you describe with many pipelines failing, though none of those log messages appeared here.

I'll try to reply again with feedback as I'm able, but unfortunately time is limited on this project and runs take a long time. I probably can't dedicate much time to debugging this, and may need to change frameworks if that's the fastest approach. At the least, I'll try to provide whatever feedback I can.

@chimaerase
Copy link
Author

Ok, looks like reducing the number of cores on your suggestion worked. Thank you!

I was able to optimize hyperparameters for 3 different models in sequence in a single script without crashing, though there were some scattershot timed out / killed worker processes. Also in some cases many generations without improvement, so I'll have to look into whether those are real or perhaps a less-noticable side effect of memory-related errors.

This all seems to support your hypothesis about what's happening.

A random assortment of related error messages from one portion of the run:

2025-02-05 01:12:18,787 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('eval_objective_list-feadd290d16e8a510814e977fbcf97da')" coro=<Worker.execute() done, defined at /home/z-user/.venv/lib/python3.11/site-packages/distributed/worker_state_machine.py:3606>> ended with CancelledError
2025-02-05 01:12:18,787 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('eval_objective_list-36b29cb6e624f159e94d7582b9115dd3')" coro=<Worker.execute() done, defined at /home/z-user/.venv/lib/python3.11/site-packages/distributed/worker_state_machine.py:3606>> ended with CancelledError
2025-02-05 01:12:18,787 - distributed.worker.state_machine - WARNING - Async instruction for <Task cancelled name="execute('eval_objective_list-1e68a61ed9b54153fb386a684928764e')" coro=<Worker.execute() done, defined at /home/z-user/.venv/lib/python3.11/site-packages/distributed/worker_state_machine.py:3606>> ended with CancelledError
2025-02-05 01:12:18,789 - distributed.scheduler - ERROR - Removing worker 'tcp://127.0.0.1:45473' caused the cluster to lose scattered data, which can't be recovered: {'DataFrame-3134d51a0b1eceba930ffd3057eb7a0a', 'Series-96c12667ead5d0cdfc23e86a401e3ec2'} (stimulus_id='handle-worker-cleanup-1738717938.7891097')
2025-02-05 01:12:22,787 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2025-02-05 01:12:22,788 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2025-02-05 01:12:22,788 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants