"Number of threads per block exceeds kernel limit" when computing a reduction on a `KernelFunctionOperation` on some GPUs #4047

ali-ramadhan · 2025-01-17T03:31:15Z

Following #4034 I was able to reliably reproduce the error. When using a 16³ LatitudeLongitudeGrid with Float32 on an NVIDIA RTX 4090 with --check-bounds=yes (needed!) the MWE below produces the error below.

There is no error with a smaller 8³ grid. There is no error with Float64. And there is no error on a RectilinearGrid.

I was not able to reproduce on a V100. But I've seen this error show up when running simulations on an immersed LatitudeLongitudeGrid with Float64 on a V100 and H100 without --check-bounds=yes.

I will try to reproduce using just CUDA.jl. It's interesting that the error suggests that the RTX 4090 has a "Maximum number of threads per block" of 512 when CUDA deviceQuery says it's 1024.

MWE:

using Oceananigans
using Oceananigans.Advection: cell_advection_timescaleᶜᶜᶜ

grid = LatitudeLongitudeGrid(GPU(), Float32;
    topology = (Bounded, Bounded, Bounded),
    size = (16, 16, 16),
    longitude = (-10, 10),
    latitude = (-10, 10),
    z = (-100, 0)
)

model = HydrostaticFreeSurfaceModel(; grid)

u, v, w = model.velocities
τ = KernelFunctionOperation{Center, Center, Center}(cell_advection_timescaleᶜᶜᶜ, grid, u, v, w)

τ_min = minimum(τ)

Error:

ERROR: Number of threads per block exceeds kernel limit (640 > 512).
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] diagnose_launch_failure(f::CUDA.CuFunction, err::CUDA.CuError; blockdim::CUDA.CuDim3, threaddim::CUDA.CuDim3, shmem::Int64)
    @ CUDA ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:120
  [3] launch(::CUDA.CuFunction, ::CUDA.KernelState, ::CartesianIndices{…}, ::CartesianIndices{…}, ::CUDA.CuDeviceArray{…}, ::KernelFunctionOperation{…}; blocks::Int64, threads::Int64, cooperative::Bool, shmem::Int64, stream::CUDA.CuStream)
    @ CUDA ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:73
  [4] launch
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:52 [inlined]
  [5] #972
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:189 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:149 [inlined]
  [7] macro expansion
    @ ./none:0 [inlined]
  [8] convert_arguments
    @ ./none:0 [inlined]
  [9] #cudacall#971
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:191 [inlined]
 [10] cudacall
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:187 [inlined]
 [11] macro expansion
    @ ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:279 [inlined]
 [12] macro expansion
    @ ./none:0 [inlined]
 [13] (::CUDA.HostKernel{…})(::typeof(identity), ::typeof(min), ::Nothing, ::CartesianIndices{…}, ::CartesianIndices{…}, ::Val{…}, ::CUDA.CuDeviceArray{…}, ::KernelFunctionOperation{…}; convert::Val{…}, call_kwargs::@Kwargs{…})
    @ CUDA ./none:0
 [14] AbstractKernel
    @ ./none:0 [inlined]
 [15] macro expansion
    @ ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:114 [inlined]
 [16] mapreducedim!(f::typeof(identity), op::typeof(min), R::SubArray{…}, A::KernelFunctionOperation{…}; init::Nothing)
    @ CUDA ~/.julia/packages/CUDA/2kjXI/src/mapreduce.jl:271
 [17] mapreducedim!(f::typeof(identity), op::typeof(min), R::SubArray{…}, A::KernelFunctionOperation{…})
    @ CUDA ~/.julia/packages/CUDA/2kjXI/src/mapreduce.jl:169
 [18] mapreducedim!(f::Function, op::Function, R::SubArray{…}, A::KernelFunctionOperation{…})
    @ GPUArrays ~/.julia/packages/GPUArrays/qt4ax/src/host/mapreduce.jl:10
 [19] minimum!(f::Function, r::SubArray{…}, A::KernelFunctionOperation{…}; init::Bool)
    @ Base ./reducedim.jl:1036
 [20] minimum!(f::Function, r::Field{…}, a::KernelFunctionOperation{…}; condition::Nothing, mask::Float64, kwargs::@Kwargs{…})
    @ Oceananigans.Fields ~/atdepth/Oceananigans.jl/src/Fields/field.jl:676
 [21] minimum(f::Function, c::KernelFunctionOperation{Center, Center, Center, LatitudeLongitudeGrid{…}, Float32, typeof(cell_advection_timescaleᶜᶜᶜ), Tuple{…}}; condition::Nothing, mask::Float64, dims::Function)
    @ Oceananigans.Fields ~/atdepth/Oceananigans.jl/src/Fields/field.jl:706
 [22] minimum
    @ ~/atdepth/Oceananigans.jl/src/Fields/field.jl:695 [inlined]
 [23] minimum(c::KernelFunctionOperation{Center, Center, Center, LatitudeLongitudeGrid{…}, Float32, typeof(cell_advection_timescaleᶜᶜᶜ), Tuple{…}})
    @ Oceananigans.Fields ~/atdepth/Oceananigans.jl/src/Fields/field.jl:715
 [24] top-level scope
    @ REPL[7]:1

caused by: CUDA error: too many resources requested for launch (code 701, ERROR_LAUNCH_OUT_OF_RESOURCES)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/libcuda.jl:30
  [2] check
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/libcuda.jl:37 [inlined]
  [3] cuLaunchKernel
    @ ~/.julia/packages/CUDA/2kjXI/lib/utils/call.jl:34 [inlined]
  [4] (::CUDA.var"#966#967"{Bool, Int64, CUDA.CuStream, CUDA.CuFunction, CUDA.CuDim3, CUDA.CuDim3})(kernelParams::Vector{Ptr{Nothing}})
    @ CUDA ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:66
  [5] macro expansion
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:33 [inlined]
  [6] macro expansion
    @ ./none:0 [inlined]
  [7] pack_arguments(::CUDA.var"#966#967"{…}, ::CUDA.KernelState, ::CartesianIndices{…}, ::CartesianIndices{…}, ::CUDA.CuDeviceArray{…}, ::KernelFunctionOperation{…})
    @ CUDA ./none:0
  [8] launch(::CUDA.CuFunction, ::CUDA.KernelState, ::CartesianIndices{…}, ::CartesianIndices{…}, ::CUDA.CuDeviceArray{…}, ::KernelFunctionOperation{…}; blocks::Int64, threads::Int64, cooperative::Bool, shmem::Int64, stream::CUDA.CuStream)
    @ CUDA ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:59
  [9] launch
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:52 [inlined]
 [10] #972
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:189 [inlined]
 [11] macro expansion
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:149 [inlined]
 [12] macro expansion
    @ ./none:0 [inlined]
 [13] convert_arguments
    @ ./none:0 [inlined]
 [14] #cudacall#971
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:191 [inlined]
 [15] cudacall
    @ ~/.julia/packages/CUDA/2kjXI/lib/cudadrv/execution.jl:187 [inlined]
 [16] macro expansion
    @ ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:279 [inlined]
 [17] macro expansion
    @ ./none:0 [inlined]
 [18] (::CUDA.HostKernel{…})(::typeof(identity), ::typeof(min), ::Nothing, ::CartesianIndices{…}, ::CartesianIndices{…}, ::Val{…}, ::CUDA.CuDeviceArray{…}, ::KernelFunctionOperation{…}; convert::Val{…}, call_kwargs::@Kwargs{…})
    @ CUDA ./none:0
 [19] AbstractKernel
    @ ./none:0 [inlined]
 [20] macro expansion
    @ ~/.julia/packages/CUDA/2kjXI/src/compiler/execution.jl:114 [inlined]
 [21] mapreducedim!(f::typeof(identity), op::typeof(min), R::SubArray{…}, A::KernelFunctionOperation{…}; init::Nothing)
    @ CUDA ~/.julia/packages/CUDA/2kjXI/src/mapreduce.jl:271
 [22] mapreducedim!(f::typeof(identity), op::typeof(min), R::SubArray{…}, A::KernelFunctionOperation{…})
    @ CUDA ~/.julia/packages/CUDA/2kjXI/src/mapreduce.jl:169
 [23] mapreducedim!(f::Function, op::Function, R::SubArray{…}, A::KernelFunctionOperation{…})
    @ GPUArrays ~/.julia/packages/GPUArrays/qt4ax/src/host/mapreduce.jl:10
 [24] minimum!(f::Function, r::SubArray{…}, A::KernelFunctionOperation{…}; init::Bool)
    @ Base ./reducedim.jl:1036
 [25] minimum!(f::Function, r::Field{…}, a::KernelFunctionOperation{…}; condition::Nothing, mask::Float64, kwargs::@Kwargs{…})
    @ Oceananigans.Fields ~/atdepth/Oceananigans.jl/src/Fields/field.jl:676
 [26] minimum(f::Function, c::KernelFunctionOperation{Center, Center, Center, LatitudeLongitudeGrid{…}, Float32, typeof(cell_advection_timescaleᶜᶜᶜ), Tuple{…}}; condition::Nothing, mask::Float64, dims::Function)
    @ Oceananigans.Fields ~/atdepth/Oceananigans.jl/src/Fields/field.jl:706
 [27] minimum
    @ ~/atdepth/Oceananigans.jl/src/Fields/field.jl:695 [inlined]
 [28] minimum(c::KernelFunctionOperation{Center, Center, Center, LatitudeLongitudeGrid{…}, Float32, typeof(cell_advection_timescaleᶜᶜᶜ), Tuple{…}})
    @ Oceananigans.Fields ~/atdepth/Oceananigans.jl/src/Fields/field.jl:715
 [29] top-level scope
    @ REPL[7]:1
Some type information was truncated. Use `show(err)` to see complete types.

Environment: Oceananigans.jl main branch.

julia> versioninfo()
Julia Version 1.10.7
Commit 4976d05258e (2024-11-26 15:57 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × AMD Ryzen Threadripper 7960X 24-Cores
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 16 default, 0 interactive, 8 GC (on 48 virtual cores)
Environment:
  LD_PRELOAD = /usr/NX/lib/libnxegl.so

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.7
NVIDIA driver 565.77.0

CUDA libraries: 
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+565.77

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.10.7
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 4090 (sm_89, 19.505 GiB / 23.988 GiB available)

The text was updated successfully, but these errors were encountered:

ali-ramadhan · 2025-01-17T04:33:51Z

I was not able to reproduce using just CUDA.jl. I'll try again later. To be fair, the issue could also be in GPUArrays.jl.

Although I did not get the error after upgrading to CUDA.jl v5.6.1 even though CUDA.jl's mapreducedim! (https://github.com/JuliaGPU/CUDA.jl/blob/d07a24572814efef6691005bd33ae7fd5f978f49/src/mapreduce.jl#L271) still used threads = 640 (according to print statement debugging).

But to upgrade to CUDA.jl v5.6.1 I had to change the GPUArrays.jl [compat] entry in Project.toml, and manually disable scalar operations. Maybe this doesn't allow the error to show up.

It could be that solving the scalar operations issue and upgrading to the latest CUDA.jl (#4036) solves this issue too.

glwagner · 2025-03-05T03:49:51Z

Just wanted to make a comment here. I tried to drill into this a little and didn't make much progress. However, I think it does make sense that it depends on the precision and grid size. The reason is that CUDA.mapreducedim! makes use of shared memory. The amount of shared memory needed for a kernel depends on the precision. Also, I think that CUDA may limit thread count for certain kernels depending on shared memory usage. Hence, hitting a limit that is apparently lower than the maximum for the device.

I really do suspect that this has to be fixed within CUDA but I am not exactly sure how. One possibility though is to manually hard code compute_threads here:

https://github.com/JuliaGPU/CUDA.jl/blob/1a3669d2ae2a36a52a4dc20d82d7e686b1fb2d45/src/mapreduce.jl#L221-L227

Note this code is not a reproducer for me so I cannot test.

ali-ramadhan · 2025-03-10T13:37:36Z

I'm confirming that as of the current main branch (commit ad0236000) the MWE in the original post works correctly now: it does not error and returns Inf32 as expected.

So I guess technically this issue could be closed, although it seems to be cropping up in other situations, e.g. tests failing in #4139. And I have some non-MWE's I'd like to test.

julia> versioninfo()
Julia Version 1.10.8
Commit 4c16ff44be8 (2025-01-22 10:06 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × AMD Ryzen Threadripper 7960X 24-Cores
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 16 default, 0 interactive, 8 GC (on 48 virtual cores)
Environment:
  LD_PRELOAD = /usr/NX/lib/libnxegl.so

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.8
NVIDIA driver 570.86.16

CUDA libraries: 
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+570.86.16

Julia packages: 
- CUDA: 5.6.1
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.10.8
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 4090 (sm_89, 20.309 GiB / 23.988 GiB available)

ali-ramadhan added bug 🐞 Even a perfect program still has bugs GPU 👾 Where Oceananigans gets its powers from labels Jan 17, 2025

ali-ramadhan mentioned this issue Jan 17, 2025

Scalar indexing with CUDA 5.6.0 #4036

Closed

mncrowe mentioned this issue Feb 15, 2025

Issue with TimeStepWizard on GPU() #4105

Open

ali-ramadhan mentioned this issue Mar 4, 2025

Tests on tartarus #4139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Number of threads per block exceeds kernel limit" when computing a reduction on a `KernelFunctionOperation` on some GPUs #4047

"Number of threads per block exceeds kernel limit" when computing a reduction on a `KernelFunctionOperation` on some GPUs #4047

ali-ramadhan commented Jan 17, 2025

ali-ramadhan commented Jan 17, 2025

glwagner commented Mar 5, 2025

ali-ramadhan commented Mar 10, 2025

"Number of threads per block exceeds kernel limit" when computing a reduction on a KernelFunctionOperation on some GPUs #4047

"Number of threads per block exceeds kernel limit" when computing a reduction on a KernelFunctionOperation on some GPUs #4047

Comments

ali-ramadhan commented Jan 17, 2025

ali-ramadhan commented Jan 17, 2025

glwagner commented Mar 5, 2025

ali-ramadhan commented Mar 10, 2025

"Number of threads per block exceeds kernel limit" when computing a reduction on a `KernelFunctionOperation` on some GPUs #4047

"Number of threads per block exceeds kernel limit" when computing a reduction on a `KernelFunctionOperation` on some GPUs #4047