Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing with STATUS_INTEGER_DIVIDE_BY_ZERO error #50

Closed
fasteinke opened this issue Nov 24, 2024 · 13 comments
Closed

Crashing with STATUS_INTEGER_DIVIDE_BY_ZERO error #50

fasteinke opened this issue Nov 24, 2024 · 13 comments

Comments

@fasteinke
Copy link

When first starting to try brush, it crashed on odd occasions. Now am starting to try my own mods, and with the download of the latest it's getting much worse ...

On Windows 10, using Rust 1.82.0; training the hotdog demo zip will often get to only a few hundred steps, and then crashes with above error on console window, after doing a "cargo run". Completely random, non repeatable, and changing parameters seems to not matter.

@fasteinke
Copy link
Author

fasteinke commented Nov 26, 2024

This problem comes and goes ... I think I have a workaround - and then it comes back. I'm going to have to get my hands dirty under the hood, inside the code, to try and track the culprit - I have zero knowledge of Rust, and it's annoying having to get to grips with the language, to try and get some answers ....

@ArthurBrussee
Copy link
Owner

Hey! Yes so sorry about this, it's a really annoying error as there's very little to get going on :/ More people have reported it though, so it is definitely there.

The good news is is that I have been trying to fix some memory corruption on Metal which I think MIGHT be related to this. See

this, this and believe it or not this PR.

I'm still waiting on a Cube update at which point, there is a chance this will be fixed.

If that doesn't do it, it would be amazing if you could dive in of cours :) Worst case is doing some fun Rust!

Ps: Are you by any chance running on an integrated GPU?

@fasteinke
Copy link
Author

fasteinke commented Nov 27, 2024

Thanks for getting back on this! ... Good to hear that 'salvation' might be in sight ...

GPU is an old 4 Gig card, sitting in an also old HP workstation - with not enough compute power to use the python stuff. Which is why the brush solution will be excellent!!

Cheers!

@fasteinke
Copy link
Author

Okay, the good news is that the latest round of changes, so far, seems to have solved the crashing from a divide by zero error.

The not so good news is that the dataset viewer's interface, and linkage with the scene viewer is now badly broken - something you may be already aware of. But, I'm OK with that, for now - far better that the training continues processing!

@fasteinke
Copy link
Author

Not so good news again ... if I push the processing a bit, by changing some hard coded parameters so that more splats are produced per a certain number of iters, then instability returns - then divide by zero, and other crashes and panics occur. Is it because memory limits are being pushed? Not really sure, I don't see a pattern yet - there is no clear OOM fail barrier, as one gets, quite dependably, with python and C gaussian splatting methods I've tried.

@ArthurBrussee
Copy link
Owner

Shoot that is really bad news :/ Had really hoped some of the memory corruption fixed it! I have just pushed another update which updates Burn past the version with another memory corruption. In theory tht didn't apply on Vulkan however (default on Windows), but let me know nevertheless.

As for the dataset + scene, quite broken indeed, sorry about that! Been busy moving house but hopefully back to full speed soon to fix that.

@fasteinke
Copy link
Author

Have just updated to the latest burn version, and things seem to be better - fingers crossed! Trying to get a handle on when it crashes due to OOM. and this appears to be GPU limits - I note that the usage of the latter goes up and up, and then suddenly drops dramatically; do we know what the conditions are for this situation? If it can be controlled, or anticipated, then more ambitious runs can be tried ...

The good news is that the actual splatting does better on a test case of mine than any other GS technique I've tried! This is, handrails on the end of a building - thumbs up!

Hope the move went well; cheers,

Frank

@ArthurBrussee
Copy link
Owner

Oh hah good to hear, maybe it does help :)

The memory usage can be a bit weird indeed. The allocated I implemented for Burn allocates pages, and frees pages in a pool when it hasn't been used in N allocations, with N depending on the size of pool. When the # of splats varies wildly, many pools allocate pages, memory use goes up, until they become unused, and a bunch get removed again.

This is definitely not a great strategy - when splat numbers stay relatively stable it shoud be ok but doesn't sound like it works out well for you!

To be clear - the OOM crash is a sepreate thing you're seeing vs the integer divide? So far when I've run out of memory my GPU just starts swapping memory in and out. Terribly slow, but not crashing.

In better news, I've fixed the scene <-> dataset view link!

Also wonderful to hear quality is better :D It's generally still a bit behind but, getting there :)

@fasteinke
Copy link
Author

Okay, I tried pushing the memory thing a bit just now - and this is what I got; pretty typical, contrasting to the integer divide by zero error:
...
Iter: 6151
Iter: 6226
Iter: 6301
thread 'tokio-runtime-worker' panicked at C:\Users\fstei.cargo\git\checkouts\wgpu-f9afa33caa1e84c9\ffb4852\wgpu\src\backend\wgpu_core.rs:2314:30:
Error in Queue::submit: Validation Error

Caused by:
Parent device is lost

note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
thread 'main' panicked at C:\Users\fstei.cargo\git\checkouts\wgpu-f9afa33caa1e84c9\ffb4852\wgpu\src\backend\wgpu_core.rs:829:30:
Error in Surface::get_current_texture_view: Validation Error

Caused by:
Parent device is lost

error: process didn't exit successfully: target\debug\brush_bin.exe (exit code: 101)

The iter messages were from a println I inserted, to keep track. I don't mind if there is mad memory swapping - so long as it keeos running!

Good to hear progress is happening - look forward to getting updates!

@fasteinke
Copy link
Author

fasteinke commented Dec 3, 2024

Bad news ... the integer divide by zero crash is back with a vengeance - seemingly nothing to do with memory; happens at step 190, with no debug info, for certain conditions! Will explore a bit more to try to narrow down on precisely why the program gets unhappy ...

Latest source; also updated to rust 1.83, from 1.82, just in case. No change found in behaviour.

@fasteinke
Copy link
Author

fasteinke commented Dec 3, 2024

It's getting interesting ... looks like it's triggered by a 'just right' condition - images are 5616 pixels in width, if training resolution is set to 1404, which "nicely fits", get crash. If set to 1403, continues running ...

Update!! Spoke too soon! The 1403 variant runs longer, but when 2nd time tried, then crashed some time later ...

@fasteinke
Copy link
Author

fasteinke commented Dec 4, 2024

Okay, it's not magic numbers with regard to the image size - latest testing indicates that the program has to "settle in"; meaning, do whatever it takes with settings to get brush to happily continue processing with some image set, say in the 1,000s of steps, and then pause, and adjust parameters and reload the dataset, per what is actually desired. Note, don't exit and restart program - it will only crash, again.

@ArthurBrussee
Copy link
Owner

If you don't mind - closing this in favour of #60!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants