Crashing with STATUS_INTEGER_DIVIDE_BY_ZERO error #50

fasteinke · 2024-11-24T22:20:14Z

When first starting to try brush, it crashed on odd occasions. Now am starting to try my own mods, and with the download of the latest it's getting much worse ...

On Windows 10, using Rust 1.82.0; training the hotdog demo zip will often get to only a few hundred steps, and then crashes with above error on console window, after doing a "cargo run". Completely random, non repeatable, and changing parameters seems to not matter.

fasteinke · 2024-11-26T22:08:18Z

This problem comes and goes ... I think I have a workaround - and then it comes back. I'm going to have to get my hands dirty under the hood, inside the code, to try and track the culprit - I have zero knowledge of Rust, and it's annoying having to get to grips with the language, to try and get some answers ....

ArthurBrussee · 2024-11-26T23:29:34Z

Hey! Yes so sorry about this, it's a really annoying error as there's very little to get going on :/ More people have reported it though, so it is definitely there.

The good news is is that I have been trying to fix some memory corruption on Metal which I think MIGHT be related to this. See

this, this and believe it or not this PR.

I'm still waiting on a Cube update at which point, there is a chance this will be fixed.

If that doesn't do it, it would be amazing if you could dive in of cours :) Worst case is doing some fun Rust!

Ps: Are you by any chance running on an integrated GPU?

fasteinke · 2024-11-27T01:01:27Z

Thanks for getting back on this! ... Good to hear that 'salvation' might be in sight ...

GPU is an old 4 Gig card, sitting in an also old HP workstation - with not enough compute power to use the python stuff. Which is why the brush solution will be excellent!!

Cheers!

fasteinke · 2024-11-28T22:24:06Z

Okay, the good news is that the latest round of changes, so far, seems to have solved the crashing from a divide by zero error.

The not so good news is that the dataset viewer's interface, and linkage with the scene viewer is now badly broken - something you may be already aware of. But, I'm OK with that, for now - far better that the training continues processing!

fasteinke · 2024-11-29T22:55:01Z

Not so good news again ... if I push the processing a bit, by changing some hard coded parameters so that more splats are produced per a certain number of iters, then instability returns - then divide by zero, and other crashes and panics occur. Is it because memory limits are being pushed? Not really sure, I don't see a pattern yet - there is no clear OOM fail barrier, as one gets, quite dependably, with python and C gaussian splatting methods I've tried.

ArthurBrussee · 2024-12-02T00:14:02Z

Shoot that is really bad news :/ Had really hoped some of the memory corruption fixed it! I have just pushed another update which updates Burn past the version with another memory corruption. In theory tht didn't apply on Vulkan however (default on Windows), but let me know nevertheless.

As for the dataset + scene, quite broken indeed, sorry about that! Been busy moving house but hopefully back to full speed soon to fix that.

fasteinke · 2024-12-02T22:29:37Z

Have just updated to the latest burn version, and things seem to be better - fingers crossed! Trying to get a handle on when it crashes due to OOM. and this appears to be GPU limits - I note that the usage of the latter goes up and up, and then suddenly drops dramatically; do we know what the conditions are for this situation? If it can be controlled, or anticipated, then more ambitious runs can be tried ...

The good news is that the actual splatting does better on a test case of mine than any other GS technique I've tried! This is, handrails on the end of a building - thumbs up!

Hope the move went well; cheers,

Frank

ArthurBrussee · 2024-12-02T23:47:34Z

Oh hah good to hear, maybe it does help :)

The memory usage can be a bit weird indeed. The allocated I implemented for Burn allocates pages, and frees pages in a pool when it hasn't been used in N allocations, with N depending on the size of pool. When the # of splats varies wildly, many pools allocate pages, memory use goes up, until they become unused, and a bunch get removed again.

This is definitely not a great strategy - when splat numbers stay relatively stable it shoud be ok but doesn't sound like it works out well for you!

To be clear - the OOM crash is a sepreate thing you're seeing vs the integer divide? So far when I've run out of memory my GPU just starts swapping memory in and out. Terribly slow, but not crashing.

In better news, I've fixed the scene <-> dataset view link!

Also wonderful to hear quality is better :D It's generally still a bit behind but, getting there :)

fasteinke · 2024-12-03T00:41:10Z

Okay, I tried pushing the memory thing a bit just now - and this is what I got; pretty typical, contrasting to the integer divide by zero error:
...
Iter: 6151
Iter: 6226
Iter: 6301
thread 'tokio-runtime-worker' panicked at C:\Users\fstei.cargo\git\checkouts\wgpu-f9afa33caa1e84c9\ffb4852\wgpu\src\backend\wgpu_core.rs:2314:30:
Error in Queue::submit: Validation Error

Caused by:
Parent device is lost

note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
thread 'main' panicked at C:\Users\fstei.cargo\git\checkouts\wgpu-f9afa33caa1e84c9\ffb4852\wgpu\src\backend\wgpu_core.rs:829:30:
Error in Surface::get_current_texture_view: Validation Error

Caused by:
Parent device is lost

error: process didn't exit successfully: target\debug\brush_bin.exe (exit code: 101)

The iter messages were from a println I inserted, to keep track. I don't mind if there is mad memory swapping - so long as it keeos running!

Good to hear progress is happening - look forward to getting updates!

fasteinke · 2024-12-03T22:43:04Z

Bad news ... the integer divide by zero crash is back with a vengeance - seemingly nothing to do with memory; happens at step 190, with no debug info, for certain conditions! Will explore a bit more to try to narrow down on precisely why the program gets unhappy ...

Latest source; also updated to rust 1.83, from 1.82, just in case. No change found in behaviour.

fasteinke · 2024-12-03T22:55:13Z

It's getting interesting ... looks like it's triggered by a 'just right' condition - images are 5616 pixels in width, if training resolution is set to 1404, which "nicely fits", get crash. If set to 1403, continues running ...

Update!! Spoke too soon! The 1403 variant runs longer, but when 2nd time tried, then crashed some time later ...

fasteinke · 2024-12-04T01:34:38Z

Okay, it's not magic numbers with regard to the image size - latest testing indicates that the program has to "settle in"; meaning, do whatever it takes with settings to get brush to happily continue processing with some image set, say in the 1,000s of steps, and then pause, and adjust parameters and reload the dataset, per what is actually desired. Note, don't exit and restart program - it will only crash, again.

ArthurBrussee · 2024-12-13T10:58:06Z

If you don't mind - closing this in favour of #60!

ArthurBrussee mentioned this issue Dec 2, 2024

Floating point exception #28

Closed

ArthurBrussee closed this as completed Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashing with STATUS_INTEGER_DIVIDE_BY_ZERO error #50

Crashing with STATUS_INTEGER_DIVIDE_BY_ZERO error #50

fasteinke commented Nov 24, 2024

fasteinke commented Nov 26, 2024 •

edited

Loading

ArthurBrussee commented Nov 26, 2024

fasteinke commented Nov 27, 2024 •

edited

Loading

fasteinke commented Nov 28, 2024

fasteinke commented Nov 29, 2024

ArthurBrussee commented Dec 2, 2024

fasteinke commented Dec 2, 2024

ArthurBrussee commented Dec 2, 2024

fasteinke commented Dec 3, 2024

fasteinke commented Dec 3, 2024 •

edited

Loading

fasteinke commented Dec 3, 2024 •

edited

Loading

fasteinke commented Dec 4, 2024 •

edited

Loading

ArthurBrussee commented Dec 13, 2024

Crashing with STATUS_INTEGER_DIVIDE_BY_ZERO error #50

Crashing with STATUS_INTEGER_DIVIDE_BY_ZERO error #50

Comments

fasteinke commented Nov 24, 2024

fasteinke commented Nov 26, 2024 • edited Loading

ArthurBrussee commented Nov 26, 2024

fasteinke commented Nov 27, 2024 • edited Loading

fasteinke commented Nov 28, 2024

fasteinke commented Nov 29, 2024

ArthurBrussee commented Dec 2, 2024

fasteinke commented Dec 2, 2024

ArthurBrussee commented Dec 2, 2024

fasteinke commented Dec 3, 2024

fasteinke commented Dec 3, 2024 • edited Loading

fasteinke commented Dec 3, 2024 • edited Loading

fasteinke commented Dec 4, 2024 • edited Loading

ArthurBrussee commented Dec 13, 2024

fasteinke commented Nov 26, 2024 •

edited

Loading

fasteinke commented Nov 27, 2024 •

edited

Loading

fasteinke commented Dec 3, 2024 •

edited

Loading

fasteinke commented Dec 3, 2024 •

edited

Loading

fasteinke commented Dec 4, 2024 •

edited

Loading