-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashing with STATUS_INTEGER_DIVIDE_BY_ZERO error #50
Comments
This problem comes and goes ... I think I have a workaround - and then it comes back. I'm going to have to get my hands dirty under the hood, inside the code, to try and track the culprit - I have zero knowledge of Rust, and it's annoying having to get to grips with the language, to try and get some answers .... |
Hey! Yes so sorry about this, it's a really annoying error as there's very little to get going on :/ More people have reported it though, so it is definitely there. The good news is is that I have been trying to fix some memory corruption on Metal which I think MIGHT be related to this. See this, this and believe it or not this PR. I'm still waiting on a Cube update at which point, there is a chance this will be fixed. If that doesn't do it, it would be amazing if you could dive in of cours :) Worst case is doing some fun Rust! Ps: Are you by any chance running on an integrated GPU? |
Thanks for getting back on this! ... Good to hear that 'salvation' might be in sight ... GPU is an old 4 Gig card, sitting in an also old HP workstation - with not enough compute power to use the python stuff. Which is why the brush solution will be excellent!! Cheers! |
Okay, the good news is that the latest round of changes, so far, seems to have solved the crashing from a divide by zero error. The not so good news is that the dataset viewer's interface, and linkage with the scene viewer is now badly broken - something you may be already aware of. But, I'm OK with that, for now - far better that the training continues processing! |
Not so good news again ... if I push the processing a bit, by changing some hard coded parameters so that more splats are produced per a certain number of iters, then instability returns - then divide by zero, and other crashes and panics occur. Is it because memory limits are being pushed? Not really sure, I don't see a pattern yet - there is no clear OOM fail barrier, as one gets, quite dependably, with python and C gaussian splatting methods I've tried. |
Shoot that is really bad news :/ Had really hoped some of the memory corruption fixed it! I have just pushed another update which updates Burn past the version with another memory corruption. In theory tht didn't apply on Vulkan however (default on Windows), but let me know nevertheless. As for the dataset + scene, quite broken indeed, sorry about that! Been busy moving house but hopefully back to full speed soon to fix that. |
Have just updated to the latest burn version, and things seem to be better - fingers crossed! Trying to get a handle on when it crashes due to OOM. and this appears to be GPU limits - I note that the usage of the latter goes up and up, and then suddenly drops dramatically; do we know what the conditions are for this situation? If it can be controlled, or anticipated, then more ambitious runs can be tried ... The good news is that the actual splatting does better on a test case of mine than any other GS technique I've tried! This is, handrails on the end of a building - thumbs up! Hope the move went well; cheers, Frank |
Oh hah good to hear, maybe it does help :) The memory usage can be a bit weird indeed. The allocated I implemented for Burn allocates pages, and frees pages in a pool when it hasn't been used in N allocations, with N depending on the size of pool. When the # of splats varies wildly, many pools allocate pages, memory use goes up, until they become unused, and a bunch get removed again. This is definitely not a great strategy - when splat numbers stay relatively stable it shoud be ok but doesn't sound like it works out well for you! To be clear - the OOM crash is a sepreate thing you're seeing vs the integer divide? So far when I've run out of memory my GPU just starts swapping memory in and out. Terribly slow, but not crashing. In better news, I've fixed the scene <-> dataset view link! Also wonderful to hear quality is better :D It's generally still a bit behind but, getting there :) |
Okay, I tried pushing the memory thing a bit just now - and this is what I got; pretty typical, contrasting to the integer divide by zero error: Caused by: note: run with Caused by: error: process didn't exit successfully: The iter messages were from a println I inserted, to keep track. I don't mind if there is mad memory swapping - so long as it keeos running! Good to hear progress is happening - look forward to getting updates! |
Bad news ... the integer divide by zero crash is back with a vengeance - seemingly nothing to do with memory; happens at step 190, with no debug info, for certain conditions! Will explore a bit more to try to narrow down on precisely why the program gets unhappy ... Latest source; also updated to rust 1.83, from 1.82, just in case. No change found in behaviour. |
It's getting interesting ... looks like it's triggered by a 'just right' condition - images are 5616 pixels in width, if training resolution is set to 1404, which "nicely fits", get crash. If set to 1403, continues running ... Update!! Spoke too soon! The 1403 variant runs longer, but when 2nd time tried, then crashed some time later ... |
Okay, it's not magic numbers with regard to the image size - latest testing indicates that the program has to "settle in"; meaning, do whatever it takes with settings to get brush to happily continue processing with some image set, say in the 1,000s of steps, and then pause, and adjust parameters and reload the dataset, per what is actually desired. Note, don't exit and restart program - it will only crash, again. |
If you don't mind - closing this in favour of #60! |
When first starting to try brush, it crashed on odd occasions. Now am starting to try my own mods, and with the download of the latest it's getting much worse ...
On Windows 10, using Rust 1.82.0; training the hotdog demo zip will often get to only a few hundred steps, and then crashes with above error on console window, after doing a "cargo run". Completely random, non repeatable, and changing parameters seems to not matter.
The text was updated successfully, but these errors were encountered: