-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Number of threads per block exceeds kernel limit" when computing a reduction on a KernelFunctionOperation
on some GPUs
#4047
Comments
I was not able to reproduce using just CUDA.jl. I'll try again later. To be fair, the issue could also be in GPUArrays.jl. Although I did not get the error after upgrading to CUDA.jl v5.6.1 even though CUDA.jl's But to upgrade to CUDA.jl v5.6.1 I had to change the GPUArrays.jl It could be that solving the scalar operations issue and upgrading to the latest CUDA.jl (#4036) solves this issue too. |
Just wanted to make a comment here. I tried to drill into this a little and didn't make much progress. However, I think it does make sense that it depends on the precision and grid size. The reason is that I really do suspect that this has to be fixed within CUDA but I am not exactly sure how. One possibility though is to manually hard code Note this code is not a reproducer for me so I cannot test. |
I'm confirming that as of the current So I guess technically this issue could be closed, although it seems to be cropping up in other situations, e.g. tests failing in #4139. And I have some non-MWE's I'd like to test.
|
Following #4034 I was able to reliably reproduce the error. When using a 16³
LatitudeLongitudeGrid
withFloat32
on an NVIDIA RTX 4090 with--check-bounds=yes
(needed!) the MWE below produces the error below.There is no error with a smaller 8³ grid. There is no error with
Float64
. And there is no error on aRectilinearGrid
.I was not able to reproduce on a V100. But I've seen this error show up when running simulations on an immersed
LatitudeLongitudeGrid
withFloat64
on a V100 and H100 without--check-bounds=yes
.I will try to reproduce using just CUDA.jl. It's interesting that the error suggests that the RTX 4090 has a "Maximum number of threads per block" of 512 when CUDA deviceQuery says it's 1024.
MWE:
Error:
Environment: Oceananigans.jl
main
branch.The text was updated successfully, but these errors were encountered: