-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training consistently stalling #2089
Comments
Hi @gomesteixeira, I believe the issue here is that the centered instance model is expecting a
We fixed this in #2054, but this didn't make it into the newest release (v1.4.1). While you could install from source to get the latest changes, I think you could just fix your issue by setting the input scaling to 1.0 in your centered instance model. You're already cropping to a decently small size, so I don't think you want to be scaling it down any further. Let us know if that works! Cheers, Talmo |
Hi talmo, Thanks for your quick response! That makes sense. To be honest, I feel a little silly now, because that means I did change something else about the hyperparameters compared to what I was doing before, when it was working. I'm happy to hear this will be fixed in a future release. Also, I would suggest to maybe explain better in the docs/tutorial how these crops work and interact with one another (although maybe I just misunderstood/misinterpreted it). In any case, my centered instance model was still not working (the centroid model still trains just fine). Although now it was throwing an error, so I guess that is an improvement! The error dialog box points to the console log, which I'm including below. Console Log
From reading the log I felt this error was caused by a lack of memory on my PC (which I didn't experience before I think), so I halved the batch size for the centered instance model training and it is working now. I should mention that I used the centroid model I just trained when this error was thrown, instead of training a new one. Do you think it's a better idea to just train each one at a time? I don't see a reason why training the centroid one just before would contribute to the issue, but to be honest I didn't test that. Let me know if it would be useful to test it! In any case, a lower batch size might lead to overfitting, which considering the goals of my implementation, I'm very adamant about preventing, so I would love to hear your insights about other ways I could potentially address this (apart from buying more memory, which I probably will have to do in the future anyway). In any case, I think you can close this issue. Thank you again for helping! |
Hi @gomesteixeira, Ok great!
Yeah, I think with the fix it won't really matter anymore, but maybe we should note this somewhere anyhow. Regarding the new issue: yes, it's related to running out of memory on your GPU. Here's the culprit:
In top-down models, SLEAP will automatically calculate the size of the bounding box to crop around the animals based on the sizes present in your labels. It seems like there is at least one instance that is quite large (1000x1000 px or larger). I can't tell what the original size of your full frames are from the logs, but I'm guessing that that size is way too big no matter what. This often happens when you have a stray annotation where maybe you forgot to mark a couple nodes as "not visible", so the bounding box is huge. You should definitely look through your labels to make sure one of them isn't messed up, but either way, you can circumvent this problem by just setting the bounding box crop size manually. From the original logs, it looks like they were Let us know if that works for you :) Cheers, Talmo |
Bug description
Hi everyone!
I’m encountering an issue where the model training stalls consistently at the same point. I have posted about this in a thread where a similar (if not the same) problem was discussed, however I’m now raising an issue because I tried more troubleshooting solutions, and I will now add more details.
Expected behaviour
My implementation is to track two (for now) rats in a large environment, recorded from above. Because of that, I am using the multi-animal top-down model. Initially I had defined a quite complex skeleton, with 11 keypoints/bodyparts. I made sure to use an adequate anchor point and defined a sufficiently large crop size and use -180° to 180° for the augmentation (both on the centroid and centered instance models). I didn’t really touch any of the other hyperparameters.
That was working ok-ish, meaning that I could train the model on some 80 labeled frames, and it would distinguish the two rats, however not properly identify the bodyparts. Because of that, I decided to make a new project, where I removed some bodyparts that were not even visible most of the time, and this time I also defined the max number of instances (to 2). I didn't change anything else from what I was doing before, and this is when this issue showed up.
Actual behaviour
The training for the centroid model apparently works fine, then the training for the centered instance model always stalls after 198 or 199 batches, on training epoch 1. By stalling I mean that nothing happens, it just keeps running without advancing. When I press the ‘Stop Early’ button, also nothing happens, but the ‘Cancel Training’ does work. I have left it running for two days (while before it would be done in a couple hours), to make sure it never advances.
By inspecting the conda console, I see ‘Finished training centroid’, then it starts the centered instance training, and throws the following:
On different troubleshooting runs (I'll describe my troubleshooting attempts below) I have seen different numbers for the shape, however this specific error always comes up. I'm attaching the console logs that pertain only to the centered instance model training, however please let me know if it would be useful to attach the whole console output.
Your personal set up
OS: Windows-10-10.0.22621-SP0
Version(s): SLEAP v1.3.3, python 3.7.12
SLEAP installation method (listed here):
Environment packages
Logs
Troubleshooting attempts / other thoughts
I have tried labelling more data, and also reducing the plateau patience and the batch size for the centered instance training (in this case, it stalls at 195 instead of 199 batches). I have also tried both the baseline and different previously trained models (which show up on the dragdown list).
Because of this, I concluded the issue is probably on my machine, so I tried trivial things like rebooting, freeing up memory on the disk and disk defragging (which probably don’t even have anything to do with it anyway, but always worth a try). I also updated my Nvidia drivers.
Because I am using the same data, the same machine, same SLEAP version, I decided to go back to my initial project, which as I mentioned, was working, in the sense the training would conclude and I would get predictions from the model, and run the training again, and now I encounter the same issue. Hence, I am very confident the problem results somehow from my machine, and not model specifics, but I can’t understand what it could be. In between running the initial project, and this new one where the issue came up, there was the Christmas break, so now I’m thinking maybe there was some sort of Windows update that could have messed it up? I’ve had that experience in the past with other software.
I’ll be happy to hear your suggestions/insights about how I can resolve this issue.
Screenshots
How to reproduce
I just try to Predict > Run Training, and click 'Run' after setting the desired hyperparameters.
I see the graph for the centroid training, showing its progression. I also see the graph for the centered instance, however it only shows light-blue points (as opposed to the previous graph). I don't think I can provide more details about how to reproduce this issue, unfortunately.
The text was updated successfully, but these errors were encountered: