Out of bounds memory access in `Cuda.test_batched_getrs_t_double` unit test (H100, Cuda 12.0.0) #2485

cwpearson · 2025-01-21T17:29:51Z

Presumably introduced in #2483

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Mon_Oct_24_19:12:58_PDT_2022
Cuda compilation tools, release 12.0, V12.0.76
Build cuda_12.0.r12.0/compiler.31968024_0

$ nvidia-smi
Tue Jan 21 10:29:38 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 PCIe               On  | 00000000:27:00.0 Off |                    0 |
| N/A   36C    P0              48W / 350W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 PCIe               On  | 00000000:38:00.0 Off |                    0 |
| N/A   36C    P0              48W / 350W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Reproducer @ sandia:

ssh blake.sandia.gov
salloc -N 1 -t 120 -p H100
module load cmake gcc/11.3.0 cuda/12.0.0

export WORKDIR=$HOME/proj/kk-2316-blake
mkdir -p "$WORKDIR"
export KOKKOS_SRC="$WORKDIR"/kokkos
export KOKKOS_BUILD="$WORKDIR"/build-kokkos
export KOKKOS_INSTALL="$WORKDIR"/install-kokkos
export KERNELS_SRC="$WORKDIR"/kernels
export KERNELS_BUILD="$WORKDIR"/build-kernels

git clone --branch develop [email protected]:kokkos/kokkos.git "$KOKKOS_SRC"
git clone --branch develop [email protected]:/kokkos/kokkos-kernels.git "$KERNELS_SRC"

cmake -S "$KOKKOS_SRC" -B "$KOKKOS_BUILD" \
  -DCMAKE_INSTALL_PREFIX="$KOKKOS_INSTALL" \
  -DCMAKE_CXX_COMPILER="$KOKKOS_SRC"/bin/nvcc_wrapper \
  -DKokkos_ENABLE_CUDA=ON \
  -DKokkos_ARCH_HOPPER90=ON \
  -DKokkos_ENABLE_TESTS=ON 
  
cmake --build "$KOKKOS_BUILD" --parallel $(nproc) --target install

cmake -S "$KERNELS_SRC" -B "$KERNELS_BUILD" \
  -DCMAKE_CXX_COMPILER="$KOKKOS_SRC"/bin/nvcc_wrapper \
  -DKokkos_ROOT="$KOKKOS_INSTALL" \
  -DKokkosKernels_ENABLE_TPL_CUSPARSE=ON \
  -DKokkosKernels_ENABLE_TESTS=ON
  
cmake --build "$KERNELS_BUILD" --parallel $(nproc) --target KokkosKernels_batched_dla_cuda

compute-sanitizer \
  "$KERNELS_BUILD"/batched/dense/unit_test/KokkosKernels_batched_dla_cuda \
  --gtest_filter='*getrs*'

[ RUN      ] Cuda.test_batched_getrs_t_double
.../proj/kk-2316-blake/kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.472168 vs 2.22045e-13
.../proj/kk-2316-blake/kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.632412 vs 2.22045e-13
.../proj/kk-2316-blake/kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.484741 vs 2.22045e-13
.../proj/kk-2316-blake/kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1.11126 vs 2.22045e-13
========= Invalid __global__ read of size 4 bytes
=========     at 0x2610 in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double ***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<double **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>>, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double ***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<double **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>>, Kokkos::Cuda>>(T1)
=========     by thread (0,1,0) in block (0,0,0)
=========     Address 0x7f23e316ec9c is out of bounds
=========     and is 320,498,717 bytes after the nearest allocation at 0x7f23cfe00000 of size 1,867,904 bytes
...

The text was updated successfully, but these errors were encountered:

cwpearson · 2025-01-21T17:30:41Z

@yasahi-hpc any thoughts? That access is pretty far out of bounds.

cwpearson · 2025-01-21T17:54:56Z

More info: turning on -G (debug info for device code) under any build configuration makes the test pass

yasahi-hpc · 2025-01-21T18:31:51Z

@cwpearson Hmm, I have not encountered this on my environment.
I will investigate this.

To make investigation easier, is it possible to make EXPECT_NEAR_KK a macro?
Since it is a function now, the error message does not tell which line causes a problem
(error happens inside a function).
This is somewhat related to #2478, where I encountered an error in svd_double.

[ RUN      ] serial.svd_double
/home/runner/_work/kokkos-kernels/kokkos-kernels/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 4.15534e-14 vs 5.53368e-15
[  FAILED  ] serial.svd_double (8 ms)

cwpearson · 2025-01-21T19:02:40Z

I think we could make that a macro...

yasahi-hpc · 2025-01-21T19:13:15Z

If both you and @lucbv are fine, I can open a PR for that

cwpearson · 2025-01-22T15:32:16Z

Go for it @yasahi-hpc, thanks!

yasahi-hpc · 2025-01-23T19:28:38Z

@cwpearson After some investigation, I found that macrofying EXPECT_NEAR_KK is a little more complicated than I expected. I may propose another solution.

By the way, I have also confirmed that the Cuda.test_batched_getrs_t_double fails on H100 with cuda 12.0.0. Test fails at the general case for N = 1 and BlkSize = 4. Since the unit tests for getrf, trsm and laswp pass, I am not quite sure why getrs test fails.

The test passes with cuda 12.1.0.

cwpearson self-assigned this Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of bounds memory access in `Cuda.test_batched_getrs_t_double` unit test (H100, Cuda 12.0.0) #2485

Out of bounds memory access in `Cuda.test_batched_getrs_t_double` unit test (H100, Cuda 12.0.0) #2485

cwpearson commented Jan 21, 2025

cwpearson commented Jan 21, 2025

cwpearson commented Jan 21, 2025 •

edited

Loading

yasahi-hpc commented Jan 21, 2025

cwpearson commented Jan 21, 2025

yasahi-hpc commented Jan 21, 2025

cwpearson commented Jan 22, 2025

yasahi-hpc commented Jan 23, 2025

Out of bounds memory access in Cuda.test_batched_getrs_t_double unit test (H100, Cuda 12.0.0) #2485

Out of bounds memory access in Cuda.test_batched_getrs_t_double unit test (H100, Cuda 12.0.0) #2485

Comments

cwpearson commented Jan 21, 2025

cwpearson commented Jan 21, 2025

cwpearson commented Jan 21, 2025 • edited Loading

yasahi-hpc commented Jan 21, 2025

cwpearson commented Jan 21, 2025

yasahi-hpc commented Jan 21, 2025

cwpearson commented Jan 22, 2025

yasahi-hpc commented Jan 23, 2025

Out of bounds memory access in `Cuda.test_batched_getrs_t_double` unit test (H100, Cuda 12.0.0) #2485

Out of bounds memory access in `Cuda.test_batched_getrs_t_double` unit test (H100, Cuda 12.0.0) #2485

cwpearson commented Jan 21, 2025 •

edited

Loading