Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of bounds memory access in Cuda.test_batched_getrs_t_double unit test (H100, Cuda 12.0.0) #2485

Open
cwpearson opened this issue Jan 21, 2025 · 7 comments
Assignees

Comments

@cwpearson
Copy link
Contributor

Presumably introduced in #2483

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Mon_Oct_24_19:12:58_PDT_2022
Cuda compilation tools, release 12.0, V12.0.76
Build cuda_12.0.r12.0/compiler.31968024_0
$ nvidia-smi
Tue Jan 21 10:29:38 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 PCIe               On  | 00000000:27:00.0 Off |                    0 |
| N/A   36C    P0              48W / 350W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 PCIe               On  | 00000000:38:00.0 Off |                    0 |
| N/A   36C    P0              48W / 350W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Reproducer @ sandia:

ssh blake.sandia.gov
salloc -N 1 -t 120 -p H100
module load cmake gcc/11.3.0 cuda/12.0.0

export WORKDIR=$HOME/proj/kk-2316-blake
mkdir -p "$WORKDIR"
export KOKKOS_SRC="$WORKDIR"/kokkos
export KOKKOS_BUILD="$WORKDIR"/build-kokkos
export KOKKOS_INSTALL="$WORKDIR"/install-kokkos
export KERNELS_SRC="$WORKDIR"/kernels
export KERNELS_BUILD="$WORKDIR"/build-kernels

git clone --branch develop [email protected]:kokkos/kokkos.git "$KOKKOS_SRC"
git clone --branch develop [email protected]:/kokkos/kokkos-kernels.git "$KERNELS_SRC"

cmake -S "$KOKKOS_SRC" -B "$KOKKOS_BUILD" \
  -DCMAKE_INSTALL_PREFIX="$KOKKOS_INSTALL" \
  -DCMAKE_CXX_COMPILER="$KOKKOS_SRC"/bin/nvcc_wrapper \
  -DKokkos_ENABLE_CUDA=ON \
  -DKokkos_ARCH_HOPPER90=ON \
  -DKokkos_ENABLE_TESTS=ON 
  
cmake --build "$KOKKOS_BUILD" --parallel $(nproc) --target install

cmake -S "$KERNELS_SRC" -B "$KERNELS_BUILD" \
  -DCMAKE_CXX_COMPILER="$KOKKOS_SRC"/bin/nvcc_wrapper \
  -DKokkos_ROOT="$KOKKOS_INSTALL" \
  -DKokkosKernels_ENABLE_TPL_CUSPARSE=ON \
  -DKokkosKernels_ENABLE_TESTS=ON
  
cmake --build "$KERNELS_BUILD" --parallel $(nproc) --target KokkosKernels_batched_dla_cuda

compute-sanitizer \
  "$KERNELS_BUILD"/batched/dense/unit_test/KokkosKernels_batched_dla_cuda \
  --gtest_filter='*getrs*'
[ RUN      ] Cuda.test_batched_getrs_t_double
.../proj/kk-2316-blake/kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.472168 vs 2.22045e-13
.../proj/kk-2316-blake/kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.632412 vs 2.22045e-13
.../proj/kk-2316-blake/kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 0.484741 vs 2.22045e-13
.../proj/kk-2316-blake/kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1.11126 vs 2.22045e-13
========= Invalid __global__ read of size 4 bytes
=========     at 0x2610 in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double ***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<double **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>>, Test::Getrs::Functor_BatchedSerialGetrs<Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::View<double ***, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Kokkos::View<double **, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>>, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>, KokkosBlas::Algo::Level3::Unblocked>, int>::Reducer, void>, Kokkos::RangePolicy<Kokkos::Cuda, Test::Getrs::ParamTag<KokkosBlas::Trans::Transpose>>, Kokkos::Cuda>>(T1)
=========     by thread (0,1,0) in block (0,0,0)
=========     Address 0x7f23e316ec9c is out of bounds
=========     and is 320,498,717 bytes after the nearest allocation at 0x7f23cfe00000 of size 1,867,904 bytes
...
@cwpearson cwpearson self-assigned this Jan 21, 2025
@cwpearson
Copy link
Contributor Author

@yasahi-hpc any thoughts? That access is pretty far out of bounds.

@cwpearson
Copy link
Contributor Author

cwpearson commented Jan 21, 2025

More info: turning on -G (debug info for device code) under any build configuration makes the test pass

@yasahi-hpc
Copy link
Contributor

@cwpearson Hmm, I have not encountered this on my environment.
I will investigate this.

To make investigation easier, is it possible to make EXPECT_NEAR_KK a macro?
Since it is a function now, the error message does not tell which line causes a problem
(error happens inside a function).
This is somewhat related to #2478, where I encountered an error in svd_double.

[ RUN      ] serial.svd_double
/home/runner/_work/kokkos-kernels/kokkos-kernels/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:140: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 4.15534e-14 vs 5.53368e-15
[  FAILED  ] serial.svd_double (8 ms)

@cwpearson
Copy link
Contributor Author

I think we could make that a macro...

@yasahi-hpc
Copy link
Contributor

If both you and @lucbv are fine, I can open a PR for that

@cwpearson
Copy link
Contributor Author

Go for it @yasahi-hpc, thanks!

@yasahi-hpc
Copy link
Contributor

@cwpearson After some investigation, I found that macrofying EXPECT_NEAR_KK is a little more complicated than I expected. I may propose another solution.

By the way, I have also confirmed that the Cuda.test_batched_getrs_t_double fails on H100 with cuda 12.0.0. Test fails at the general case for N = 1 and BlkSize = 4. Since the unit tests for getrf, trsm and laswp pass, I am not quite sure why getrs test fails.

The test passes with cuda 12.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants