Potential memory leak during inference? #2081

olinesn · 2025-01-06T00:20:04Z

Bug description

When running inference, the GPU starts using system memory (as seen in the task manager) until there's none left, and then inference crashes. You can see that the GPU's "Shared GPU memory usage" climbs, but the on-GPU memory is hardly used at all.

Expected behaviour

Inference completes without issue, the way that it does for shorter videos.

Actual behaviour

Your personal set up

Threadripper PRO 24 core
128 GB memory
nVidia RTX 6000

OS:
Windows 11
Version(s):
1.3.3
SLEAP installation method (listed here):

Environment packages

# paste output of `pip freeze` or `conda list` here

Logs

# paste relevant logs here, if any

Screenshots

The text was updated successfully, but these errors were encountered:

eberrigan · 2025-01-06T18:22:45Z

O wow, that is a lot of GPU memory.

With your sleap environment activated, can you let us know the output of the command nvidia-smi. I would like to see your GPU set-up.

Thanks!

Elizabeth

olinesn · 2025-01-06T19:48:03Z

O wow, that is a lot of GPU memory.

With your sleap environment activated, can you let us know the output of the command nvidia-smi. I would like to see your GPU set-up.

Thanks!

Elizabeth

Sure, here you go. Just to clarify, it doesn't look like the 48GB of the RTX 6000's GPU memory is being used. Rather it looks like this is getting swapped to system memory.

eberrigan · 2025-01-06T21:27:49Z

With the first screen shot you sent it looks like it is trying to use the CPU. With the second screen shot I can partially see that it is using GPU 0, which is the one we want to use.

Do you mind just copying and pasting the entire command and output from the terminal instead of the screenshots?

Thanks!

olinesn · 2025-01-08T22:44:46Z

Hi @eberrigan , thanks for helping me work on this puzzle.

Here's a zipped file with the models (centroid and centered instance), a demo video to run inference on, and my logs when I run sleap-track.

You can see that even for a short video, when I run sleap-track, the GPU's dedicated memory is relatively unused, but the GPU swap memory keeps increasing until the rig is out of RAM.

At that point, in the logs, you can see that something changes (stalls at 57% completed, when all 128 GB of ram get saturated), and then something gets adjusted and it completes. Also it gets stuck at 100% with the green bar for several minutes, with all 24 of my cores running at 50%, not sure what that means...

Thanks!

Zipped file: https://drive.google.com/file/d/1NpfDJHKSh9Sv_ycMrNLf5lpOCn9giwQI/view?usp=sharing

olinesn · 2025-01-08T22:45:37Z

eberrigan · 2025-01-13T18:42:50Z

Hey @olinesn is this a single animal experiment? I noticed you have --tracking.clean_instance_count set to one.

Let's just go ahead and do inference without tracking, if that is the case. You can set the -n MAX_INSTANCES, --max_instances MAX_INSTANCES which "Limit maximum number of instances in multi-instance models. Not available for ID models. Defaults to None." to 1 in order to perform inference with the top-down model with the number of animals = 1. Since you do not have more than one animal, there is not need to do tracking, and we can improve the model predictions with that constraint.

https://sleap.ai/develop/guides/cli.html

olinesn · 2025-01-13T22:06:17Z

Hi @eberrigan ,

Ok this seems to be running without complaint, that's strange that tracking would be causing this problem. Thanks for the suggestion!

A fraction of this dataset is two animals, so I'm going to need to do tracking eventually. Is there a better way that I can set up the flow tracking? Does this behavior indicate a memory leak issue in sleap?

Thanks,
Stefan

eberrigan · 2025-01-13T23:41:44Z

You should be able to do tracking but there isn't a reason to when there is only one animal.

I believe the issue was with the inference not knowing how many animals are in the new data, so that the shape of the tensor changes, causing retracing tensorflow/tensorflow#34025. This might not be an issue when using the bottom-up method, which doesn't rely on the centroid model.

If you have some data with different numbers of animals you might get better results running inference separately and specifying the number of animals per dataset using -n. Am I understanding that correctly, or are you saying you have data where the animals go off-camera and then come back?

olinesn · 2025-01-13T23:52:56Z

Thanks ok that's helpful to understand. Sometimes I place one mouse in the box, and sometimes I place two.

Would you advice some sample syntax for the sleap-track command when there are two animals in the box? The logic makes sense of the tensor changing size, but I want to make sure I nail the syntax the way you're recommending.

eberrigan · 2025-01-14T00:11:02Z

I see. Can you separate the videos so that when there is only one animal you run inference with -n set to 1 on that video, and when there are two animals you run inference with -n set to 2? This should help the model know when there are one or two animals and eliminate any shape discrepancies. If you cannot do that, then setting the max instances to 2 in all cases should suffice.

It will also improve tracking a lot since if everything is one video and tracking is run, when an animal reappears, a new track will be made, so I expect that if you are swapping animals or removing and replacing animals, you may end up with a lot of tracks at the end the video. Do you have some sort of pipeline for dealing with that?

olinesn · 2025-01-21T17:49:44Z

Sorry I got slammed at the end of last week. Thanks for your thoughts.

I see. Can you separate the videos so that when there is only one animal you run inference with -n set to 1 on that video, and when there are two animals you run inference with -n set to 2? This should help the model know when there are one or two animals and eliminate any shape discrepancies. If you cannot do that, then setting the max instances to 2 in all cases should suffice.

Yes it's perfectly doable for me to pre-determine the number of animals for the majority of these experiments. What would a reasonable sleap-track command look like? I just want to make sure I'm interpreting this correctly:

You can set the -n MAX_INSTANCES, --max_instances MAX_INSTANCES which "Limit maximum number of instances in multi-instance models. Not available for ID models. Defaults to None."

are "-n" and "--max_instances" synonymous, or do you have to use both to get this effect? I've never used "-n."

It will also improve tracking a lot since if everything is one video and tracking is run, when an animal reappears, a new track will be made, so I expect that if you are swapping animals or removing and replacing animals, you may end up with a lot of tracks at the end the video. Do you have some sort of pipeline for dealing with that?

We actually have to think about this a lot because of reflections. Sometimes if there are 3 animals plus reflections, one of the reflections can get picked up as an instance, and occasionally has a higher score than the 3 real animals. If we set max instances to 3, then we drop the instance of a real animal on that frame, so usually we're setting it to n+1 or n+2.

roomrys · 2025-01-22T23:34:31Z

Hi @olinesn,

Yes, -n and --max_instances are synonymous, the latter just more verbose and possibly more readable for others.

Thanks,
Liezl

olinesn · 2025-02-07T21:34:19Z

@eberrigan Unfortunately this doesn't seem to solve the problem. Are you able to try giving it a go? If you can download the zipped file and try running sleap-track, I'm curious to see if it runs for you or crashes.

olinesn added the bug Something isn't working label Jan 6, 2025

eberrigan self-assigned this Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memory leak during inference? #2081

Potential memory leak during inference? #2081

olinesn commented Jan 6, 2025

eberrigan commented Jan 6, 2025

olinesn commented Jan 6, 2025

eberrigan commented Jan 6, 2025

olinesn commented Jan 8, 2025

olinesn commented Jan 8, 2025

eberrigan commented Jan 13, 2025

olinesn commented Jan 13, 2025

eberrigan commented Jan 13, 2025

olinesn commented Jan 13, 2025

eberrigan commented Jan 14, 2025

olinesn commented Jan 21, 2025

roomrys commented Jan 22, 2025

olinesn commented Feb 7, 2025

Potential memory leak during inference? #2081

Potential memory leak during inference? #2081

Comments

olinesn commented Jan 6, 2025

Bug description

Expected behaviour

Actual behaviour

Your personal set up

Screenshots

eberrigan commented Jan 6, 2025

olinesn commented Jan 6, 2025

eberrigan commented Jan 6, 2025

olinesn commented Jan 8, 2025

olinesn commented Jan 8, 2025

eberrigan commented Jan 13, 2025

olinesn commented Jan 13, 2025

eberrigan commented Jan 13, 2025

olinesn commented Jan 13, 2025

eberrigan commented Jan 14, 2025

olinesn commented Jan 21, 2025

roomrys commented Jan 22, 2025

olinesn commented Feb 7, 2025