-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible memory leak #168
Comments
Is your dataset large? I've run into a similar issue. My working theory is that when a dataset is large and/or one of the tpot models has a transformer that blows up the memory further (like polynomialtransformer), it would then cause the dask future to crash. When the future crashes, for some reason it's not freeing it's memory. I've tried to resolve this in a recent PR #160 by manually forcing it to garbage collect every now and then. Which seemed to help the issue but not completely resolve it. Ideally when a future goes over the memory limit, dask/tpot would just cancel that pipeline evaluation. In practice, I think the memory blows up too quickly in a single operation before dask can catch it. Which causes the future to crash, and may impact other futures simultaneously while it resolves. If all futures crash, then the whole thing fails (partly due to how data is scattered with dask). some suggestions:
The most recent update to main contains the fixes I mentioned above (not sure if it was pushed to conda/pip yet. I would also recommend trying the latest version main on github to see if your performance is more stable. My PR #160 contains some more examples that can reproduce this issue. |
Thanks for your quick reply, and for the suggestions! I'll try some of them, and some more context below on my previous run in case it's helpful. I'm only running one instance of TPOT, so likely it's the sole consumer of system resources that are depleted after its done. I notice you didn't directly reply yet re: confusing variable names for the Dask My dataMy data is fairly small, though I'm not clear on different ways it might be processed and transformed internally by >>> df.shape
(312735, 240)
>>> size_bytes = df.memory_usage(deep=True).sum()
>>> size_mb = size_bytes / (1024 * 1024)
>>> size_mb
76.05307102203369 EnvironmentI was able to capture the memory and CPU traces from Docker after the fact, but still corresponding to the run from my previous attachment, and I'm including those here. I've seen explicit errors from crashed futures in previous runs with less memory, but those don't appear in my attachment, only warnings re: restarting workers. |
I agree about the variable name... Though I don't think keeping track of the client variable is what is causing this particular issue. Just to clarify, is TPOT completely crashing for you? Or does it still return an output? Can you check est.evaluated_individuals? One issue that I have noticed is that at some point the dask client crashes and can't resume, and all subsequent pipelines fail. Since tpot doesn't know the dask client itself has failed, it simply marks all following pipelines as failed thinking the error is the pipeline's fault. Are you noticing that in your runs? Its possible that (at least some) scikit-learn methods must convert the booleans to floats (at least for intermediate results), so the true size of the dataset could be 572MB if all variables need to be represented as floats. (E.g MLP would probably have everything turn into a float). It's possible that a smaller dataset size or more RAM (if possible) would resolve this. import pandas as pd |
Thanks again for the quick reply! Sometimes TPOT is crashing, but in my original report and several similar runs, it completed. Then a subsequent attempt to run basic scikit-learn code for cross validation crashed due to lack of resources. Still @ minimum 572 MB per worker X 10 workers, 12GB would presumably be enough. I'm trying to run again with 5 cores instead of 10, which should roughly halve the memory use. I've seen log messages from TPOT in earlier runs that seemed to indicate what you describe with many pipelines failing, though none of those log messages appeared here. I'll try to reply again with feedback as I'm able, but unfortunately time is limited on this project and runs take a long time. I probably can't dedicate much time to debugging this, and may need to change frameworks if that's the fastest approach. At the least, I'll try to provide whatever feedback I can. |
Ok, looks like reducing the number of cores on your suggestion worked. Thank you! I was able to optimize hyperparameters for 3 different models in sequence in a single script without crashing, though there were some scattershot timed out / killed worker processes. Also in some cases many generations without improvement, so I'll have to look into whether those are real or perhaps a less-noticable side effect of memory-related errors. This all seems to support your hypothesis about what's happening. A random assortment of related error messages from one portion of the run:
|
I've been using
TPOTEstimator
to evolve hyperparameters for my ML models, and it appears there's probably a memory leak in the TPOT code. My WIP code & data are closed-source, so I won't be able to definitively prove that that's the case, but filing a report since this might still be useful information.Related + IMO confusing TPOT code
In a quick skim through the TPOT code, I notice this code block, with two near-identically named data members that could easily be confused with each other and cause a memory leak. Regardless of whether my example is actually a TPOT memory leak, I'd suggest using one
client
variable and a separatebool
to track whether or not it was user-provided. These easily-confused variables seem likely to be a source of future bugs, if not the cause of a current one.My (imperfect) example
I'm using 10 cores and my training data has shape (312_735, 240). Using TPOT2 0.1.9a0.
I'm attaching some runtime output 2025-01-29 Likely memory leak.txt, which is a little noisy, but my script doesn't do much other than use TPOT before failing (in this case). Many other tests with no code edits but different data & TPOT versions have worked. Some relevant log messages match up with my recipe below:
fit()
has finished.My Code
My WIP code basically does this:
Some worker restart messages don't build confidence, but this step completes.
I know TPOT is already doing this, just accounts for a different, optional code path where TPOT isn't involved, and in my case isn't too expensive to run again.
I'll look into providing my own Dask client to TPOT to work around this, but I think the potential for a TPOT memory leak is still worth reporting.
The TPOTEstimator portion of the code is:
The text was updated successfully, but these errors were encountered: