-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full system hang on Apple M1 8GB #548
Comments
I had the same issue when aggressively zooming in and out. No stuttering, but suddenly the whole system hangs. I am on a M2 Pro 32GB |
We'll look into it. A good way of isolating this is to turn on |
I ran into this with some scene content today. I'll investigate. |
Hi @jrmoulton. We think that #551 or #553 might have fixed this issue. Does this still happen for you on main? |
I'm able to repro this fairly straightforwardly on M1. There are two separate issues, both of which need to be fixed. The first is that zooming in causes unboundedly large memory usage, specifically in flatten. This is because it's flattening all outlines, using the transform to determine the number of subdivisions. What needs to happen is culling the flattened line segments to the viewport. That's also related to vello#542. The second problem is that when the "line soup" buffer overflows, that should be detected and all downstream work should early-out. This is what #553 was trying to address, but it seems that didn't catch every case. What should be happening is that test in the binning stage should see that I did a little validation of this, mostly observing a panic when using |
My M3 Pro Max 64GB doesn’t want to start correctly again after this. Can’t seem to get to logging in successfully. edit: third time was the charm |
A couple of updates; I'm digging into this. First, it does not seem to repro on v0.1.0. Second, I had originally suspected #537, but it repros in the parent of that. My current working hypothesis is that flatten itself is getting stuck. I'm starting to do some testing with all downstream shaders ablated, and haven't seen a full hard hang, but have seen the GPU get into a bad state. If this is the case, then it's likely that there's a reasonably quick patch to just limit the amount of subdivision in flatten. I'm also starting to wonder whether the best approach is going to be aggressive culling to the viewport in flatten; if nothing else, then that will be a major performance improvement in the highly zoomed in case. It's not clear to me why this would be triggered from floem, I'm wondering whether there's something invalid about the scene. Could you provide more detailed repro steps? |
I've locally tried applying this patch:
Rendering becomes very slow when zoomed in with big factors, but it doesn't cause a full system hang. That's possibly something to try with the floem use case. Another thing to try is turning CPU shaders - I expect it to panic with an out-of-bounds when writing to LineSoup from flatten. |
@DJMcNab noticed that in my usage of Vello in Floem I wan't ever doing a scene reset. After adding a scene reset I no longer experience a hang. I don't know if that is considered resolving the issue but it does unblock me from further integration with Vello |
As the scale factor becomes unreasonably large, the number of subdivisions of an Euler spiral into lines grows, making the flatten stage take too long. This patch just bounds the number of subdivisions, so it will eventually make progress. It will be slow, but not hang. A better solution would be to aggressively cull so it only generates geometry inside the viewport, but that is considerably more complicated. Workaround for #548
As the scale factor becomes unreasonably large, the number of subdivisions of an Euler spiral into lines grows, making the flatten stage take too long. This patch just bounds the number of subdivisions, so it will eventually make progress. It will be slow, but not hang. A better solution would be to aggressively cull so it only generates geometry inside the viewport, but that is considerably more complicated. Workaround for #548
The problem still exists.
|
@XdaTk can you please provide reproduction steps for what you're seeing? |
It got stuck and nothing responded except the mouse. I had to restart it. |
Oh, I didn't realise that your machine isn't an M1 machine, as you indicated it was by posting that in this issue. Would you mind creating a new issue for the behaviour you're seeing? As a starting point, could you please determine which GPU is seeing this crash. |
I'm still getting the full system hang on
|
Can you please confirm which commit you're using @sfjohnson? We have had a memory leak issue which was solved today (#661), which I could see causing this kind of issue. |
It was the latest 59c0fa5 with the fix applied. |
Can you determine whether this was a regression. If so, which commit was it introduced in? We have several developers on M1 family chips, so your experience is surprising to me. |
Found it! It's 1daf2a4. If I apply |
Hmm, that's concerning. Do you think you can extract the full MSL for the relevant shader with and without that setting? Is it |
Looks like it's the same MSL regardless of |
Hmm, the same MSL being generated doesn't track with my expectations. The only thing that |
I double checked and same result. I'm not sure if I collected the MSL correctly though, I did:
Is that right? |
Those aren't the shaders being generated by wgpu - those are shaders generated by vello_shaders for third-party users of our shaders. You wouldn't need to change any features to get the shaders from wgpu, although unfortunately I don't know the best way. I think it might involve either adding debug prints inside wgpu, or using your system's GPU debugging tools. |
Ok I think this makes more sense now, I am logging from inside wgpu. The diff is (hangs when not present):
Full sources: |
I realise that this will be quite hard to do, but do you think you could isolate which of those is required? The easiest one to validate would be the barrier, because you can just add a Thanks so much for being so patient with debugging this so far! |
It seems to hang unless everything is cleared, and I had to add some extra barriers. Here's what I have working, added to the start of
Note that this might not be completely optimal as I've never written WGSL before and I'm trying to minimise subjecting my computer to lots of hard reboots. Fortunately it seems this is all that is required; all other shaders work without zero initialisation. |
That clearing routine is UB. The pattern you actually want in this case is: buffer[local_id.x] = 0; for each buffer, and not in a loop I wonder if the buffers start in a poison state, so metal now decides that it can just do ub? |
Oh I see, like this right? (removed one barrier and it still works):
|
I'm desk-checking the code now to see if there's any uninitialized read. Would it be possible to isolate which of these initializations is responsible? Also, the pattern of initializing sh_bitmaps is way less efficient than it could be (though not undefined behavior, as the store is atomic). A better pattern is the initialization on lines 205-207. |
Hmm, unfortunately while trying to isolate each initialisation things stopped being predictable. Now the code I posted above sometimes causes a hang. It looks like the bug might not actually be isolated to |
I'm also quite willing to dig into this myself, but it's unclear how to repro. Just so I understand, it's failing just running the default scene, nothing special? That certainly works on my machine (M1 Pro, 14.2.1). It's certainly possible that there's an uninitialized memory read elsewhere in the pipeline, that was getting masked by the zeroing. |
I just double checked and yeah it's super easy to repro for me just by cloning the repo and running |
I am also running into this problem on an M2 mac. In my case:
This happens consistently on version 0.2.1. The behavior from the video is happening here: https://www.cocube.com/console. Happy to work with you to fix this (its not great UX to crash someones computer from your website) and I'd like to stick with vello. I've looked over the code for the shaders and have a decent enough high-level understanding to try making some changes but I could still use some guidance. |
@94bryanr Just for extra info, how much memory does your M2 Mac have? |
I have a hypothesis: this might be uninitialized memory read of workgroup shared memory. That would be consistent with zeroing the memory mitigating the problem, and would also explain why it manifests after long running time - it may be a low probability that a particular value causes an infinite loop. It's somewhat frustrating, because decent tooling could help catch it, but we don't have that. A couple things can be done. One is to carefully desk check the shaders for UMR (I looked over coarse, didn't find anything, but I could have missed something, and it might be a different shader). Another is to deliberately inject garbage initial values (3735928559u etc) and see if that changes behavior. Another pathway is to get a repro case I can observe. The application of @94bryanr sounds promising if we can get that to happen on my machine. It would be really good to get this tracked down. |
I recently upgraded from macOS 12 to 14 and now the issue is gone, even when zooming in close multiple times on |
Thanks for that report. I'm glad to hear it. This is the third report we've received of this kind of hang happening on macOS 12, and the second of it being fixed after an update of macOS. I don't think we can meaningfully take any action here. @XdaTk, please update your macOS version, but I'm going to close this on the assumption that would fix this. We can always re-open if that hypothesis is wrong. |
My M2 Mac Pro is the 32GB version and it is running MacOS Ventura 13.6.4. |
The problem persists even after updating to MacOS Sonoma. Just to recap I am experiencing the issue on an M2 Mac 32GB while using vello 0.2.1. The problem happens on MacOS Ventura and on a fully updated MacOS Sonoma. The issue does not happen on Windows. For most people experiencing this it sounds like the issue is related to zooming in and out but for me the issue only happens when the resolution of the rendering context (in my case an HTML canvas) is increased to nearly 4k. I haven't been able to test the main branch yet since it looks like most of the wgpu types are now re-exported under vello::wgpu, which required more refactoring than I was able to get done at the time (without sidetracking too much please reconsider re-exporting those types as it makes vello take over the entire wgpu pipeline. I need to change all of my wgpu::GPU and wgpu::Device etc to vello::wgpu::*.). I'm going to take another look at this over this though. @raphlinus You should be able to at least repro in the browser at https://www.cocube.com/console if you stretch the window to 4k and try scrolling up and down, but I'm not sure how valuable that will be. And thanks for the amazing work on vello on this so far - very excited about the future of the project! Update: It seems like the display freezing and requiring a reboot is no longer happening on the updated MacOS version, but I am still seeing the visual artifact of nothing rendering below a certain line, with the line rising the more the resolution is expanded. |
That problem reproduces for me on Linux, but my assumption is that it's another instance of #366. |
Yes, I suspect that is probably one of the drive-by fixes I have done in #606, give me half an hour to make a small PR fixing it. That is,
I don't understand what you're saying here, sorry. Our If the hangs are not happening, then that vindicates the decision not to close this issue. |
I get an immediate hang when I run the Xilem to_do_mvc demo, on an 8GB M2 with MacOS 13.6.7. |
@danielkeller as discussed elsewhere in this thread, updating macOS should fix this issue. We are not planning on resolving this for older macOS versions anytime soon. |
I'm experimenting with swapping in Vello as the renderer for floem and I'm running into an issue where, when using Vello,
I get this issue in the
with_winit
example in the Vello examples and also in theeditor
example in floem.In the with_winit example it will happen when I zoom in too far into the Ghostscript tiger. I have noticed that there is a limit to the amount that I can zoom in and this causes some stuttering but this is separate from when it becomes unresponsive.
In the
editor
example the issue is caused when I delete several characters from the starting text in the editor.Prior to hanging macOS activity monitor doesn't indicate high memory pressure.
The issue isn't consistent (and reproducing takes a long time) but it does happen regularly (within 30 seconds of doing the above listed actions).
Apple M1 MacBook Air
8GB RAM
Sonoma 14.2.1 (23C71)
The text was updated successfully, but these errors were encountered: