Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak during inference? #2081

Open
1 of 4 tasks
olinesn opened this issue Jan 6, 2025 · 13 comments
Open
1 of 4 tasks

Potential memory leak during inference? #2081

olinesn opened this issue Jan 6, 2025 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@olinesn
Copy link

olinesn commented Jan 6, 2025

Bug description

When running inference, the GPU starts using system memory (as seen in the task manager) until there's none left, and then inference crashes. You can see that the GPU's "Shared GPU memory usage" climbs, but the on-GPU memory is hardly used at all.

Expected behaviour

Inference completes without issue, the way that it does for shorter videos.

Actual behaviour

Image

Your personal set up

Threadripper PRO 24 core
128 GB memory
nVidia RTX 6000

Environment packages
# paste output of `pip freeze` or `conda list` here
Logs
# paste relevant logs here, if any
Image

Screenshots

Image
@olinesn olinesn added the bug Something isn't working label Jan 6, 2025
@eberrigan
Copy link
Contributor

O wow, that is a lot of GPU memory.

With your sleap environment activated, can you let us know the output of the command nvidia-smi. I would like to see your GPU set-up.

Thanks!

Elizabeth

@olinesn
Copy link
Author

olinesn commented Jan 6, 2025

O wow, that is a lot of GPU memory.

With your sleap environment activated, can you let us know the output of the command nvidia-smi. I would like to see your GPU set-up.

Thanks!

Elizabeth

Image

Sure, here you go. Just to clarify, it doesn't look like the 48GB of the RTX 6000's GPU memory is being used. Rather it looks like this is getting swapped to system memory.

@eberrigan
Copy link
Contributor

With the first screen shot you sent it looks like it is trying to use the CPU. With the second screen shot I can partially see that it is using GPU 0, which is the one we want to use.

Do you mind just copying and pasting the entire command and output from the terminal instead of the screenshots?

Thanks!

@olinesn
Copy link
Author

olinesn commented Jan 8, 2025

Hi @eberrigan , thanks for helping me work on this puzzle.

Here's a zipped file with the models (centroid and centered instance), a demo video to run inference on, and my logs when I run sleap-track.

You can see that even for a short video, when I run sleap-track, the GPU's dedicated memory is relatively unused, but the GPU swap memory keeps increasing until the rig is out of RAM.

At that point, in the logs, you can see that something changes (stalls at 57% completed, when all 128 GB of ram get saturated), and then something gets adjusted and it completes. Also it gets stuck at 100% with the green bar for several minutes, with all 24 of my cores running at 50%, not sure what that means...

Thanks!

Zipped file: https://drive.google.com/file/d/1NpfDJHKSh9Sv_ycMrNLf5lpOCn9giwQI/view?usp=sharing

Image

@olinesn
Copy link
Author

olinesn commented Jan 8, 2025

Image

@eberrigan
Copy link
Contributor

Hey @olinesn is this a single animal experiment? I noticed you have --tracking.clean_instance_count set to one.

Let's just go ahead and do inference without tracking, if that is the case. You can set the -n MAX_INSTANCES, --max_instances MAX_INSTANCES which "Limit maximum number of instances in multi-instance models. Not available for ID models. Defaults to None." to 1 in order to perform inference with the top-down model with the number of animals = 1. Since you do not have more than one animal, there is not need to do tracking, and we can improve the model predictions with that constraint.

https://sleap.ai/develop/guides/cli.html

@eberrigan eberrigan self-assigned this Jan 13, 2025
@olinesn
Copy link
Author

olinesn commented Jan 13, 2025

Hi @eberrigan ,

Ok this seems to be running without complaint, that's strange that tracking would be causing this problem. Thanks for the suggestion!

A fraction of this dataset is two animals, so I'm going to need to do tracking eventually. Is there a better way that I can set up the flow tracking? Does this behavior indicate a memory leak issue in sleap?

Thanks,
Stefan

@eberrigan
Copy link
Contributor

You should be able to do tracking but there isn't a reason to when there is only one animal.

I believe the issue was with the inference not knowing how many animals are in the new data, so that the shape of the tensor changes, causing retracing tensorflow/tensorflow#34025. This might not be an issue when using the bottom-up method, which doesn't rely on the centroid model.

If you have some data with different numbers of animals you might get better results running inference separately and specifying the number of animals per dataset using -n. Am I understanding that correctly, or are you saying you have data where the animals go off-camera and then come back?

@olinesn
Copy link
Author

olinesn commented Jan 13, 2025

Thanks ok that's helpful to understand. Sometimes I place one mouse in the box, and sometimes I place two.

Would you advice some sample syntax for the sleap-track command when there are two animals in the box? The logic makes sense of the tensor changing size, but I want to make sure I nail the syntax the way you're recommending.

@eberrigan
Copy link
Contributor

I see. Can you separate the videos so that when there is only one animal you run inference with -n set to 1 on that video, and when there are two animals you run inference with -n set to 2? This should help the model know when there are one or two animals and eliminate any shape discrepancies. If you cannot do that, then setting the max instances to 2 in all cases should suffice.

It will also improve tracking a lot since if everything is one video and tracking is run, when an animal reappears, a new track will be made, so I expect that if you are swapping animals or removing and replacing animals, you may end up with a lot of tracks at the end the video. Do you have some sort of pipeline for dealing with that?

@olinesn
Copy link
Author

olinesn commented Jan 21, 2025

Sorry I got slammed at the end of last week. Thanks for your thoughts.

I see. Can you separate the videos so that when there is only one animal you run inference with -n set to 1 on that video, and when there are two animals you run inference with -n set to 2? This should help the model know when there are one or two animals and eliminate any shape discrepancies. If you cannot do that, then setting the max instances to 2 in all cases should suffice.

Yes it's perfectly doable for me to pre-determine the number of animals for the majority of these experiments. What would a reasonable sleap-track command look like? I just want to make sure I'm interpreting this correctly:

You can set the -n MAX_INSTANCES, --max_instances MAX_INSTANCES which "Limit maximum number of instances in multi-instance models. Not available for ID models. Defaults to None."

are "-n" and "--max_instances" synonymous, or do you have to use both to get this effect? I've never used "-n."

It will also improve tracking a lot since if everything is one video and tracking is run, when an animal reappears, a new track will be made, so I expect that if you are swapping animals or removing and replacing animals, you may end up with a lot of tracks at the end the video. Do you have some sort of pipeline for dealing with that?

We actually have to think about this a lot because of reflections. Sometimes if there are 3 animals plus reflections, one of the reflections can get picked up as an instance, and occasionally has a higher score than the 3 real animals. If we set max instances to 3, then we drop the instance of a real animal on that frame, so usually we're setting it to n+1 or n+2.

@roomrys
Copy link
Collaborator

roomrys commented Jan 22, 2025

Hi @olinesn,

Yes, -n and --max_instances are synonymous, the latter just more verbose and possibly more readable for others.

Thanks,
Liezl

@olinesn
Copy link
Author

olinesn commented Feb 7, 2025

@eberrigan Unfortunately this doesn't seem to solve the problem. Are you able to try giving it a go? If you can download the zipped file and try running sleap-track, I'm curious to see if it runs for you or crashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants