cuda-memcheck: "Address ... is out of bounds" #169

hweom · 2022-07-27T22:32:18Z

Describe the bug

cuda-memcheck reports scrolling errors on example-mnist-classification like this:

========= Invalid __global__ write of size 4
=========     at 0x00001780 in void copy_kernel<float>(cublasCopyParams<float>)
=========     by thread (191,0,0) in block (0,0,0)
=========     Address 0x7fd319043efc is out of bounds

To Reproduce

Steps to reproduce the behaviour:

cargo build
cuda-memcheck target/debug/example-mnist-classification mnist linear

Expected behavior

No errors.

Please complete the following information:

System: Manjaro Linux
Version: Git commit 854043d
Rust: rustc 1.62.1 (e092d0b6b 2022-07-16)
Environment:
Backends (if relevant):
- cuda:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57       Driver Version: 515.57       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   51C    P8     5W /  N/A |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1936      G   /usr/lib/Xorg                       4MiB |
+-----------------------------------------------------------------------------+

Additional context

Note that running example-mnist-classification without cuda-memcheck works just fine and is able to converge. I only discovered this while working on #159 where doing training with CUDA does crash with CUDA_ERROR_ILLEGAL_ADDRESS when trying to copy from GPU to host. Not sure it's the same issue, but seems related.

The text was updated successfully, but these errors were encountered:

hweom · 2022-07-27T23:04:11Z

Looks like the copy() function for CUDA doesn't check that the destination is large enough (here).

By adding a check like this:

macro_rules! iblas_copy_for_cuda {
    ($t:ident) => {
        fn copy(
            &self,
            x: &SharedTensor<$t>,
            y: &mut SharedTensor<$t>,
        ) -> Result<(), ::coaster::error::Error> {
            assert_eq!(x.desc().size(), y.desc().size());

We now get a panic:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `300`,
 right: `10`', coaster-blas/src/frameworks/cuda/mod.rs:23:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
   4: coaster_blas::frameworks::cuda::<impl coaster_blas::plugin::Copy<f32> for coaster::backend::Backend<coaster::frameworks::cuda::Cuda>>::copy
             at ./coaster-blas/src/frameworks/cuda/helper.rs:109:13
   5: <juice::layers::common::linear::Linear as juice::layer::ComputeParametersGradient<f32,B>>::compute_parameters_gradient
             at ./juice/src/layers/common/linear.rs:220:9

So we're trying to copy 300 floats into a tensor with size 10. And it happens in the Linear layer here

hweom added the bug Something doesn't quite look right label Jul 27, 2022

hweom mentioned this issue Jul 28, 2022

Fix Linear layer bias gradient computation; add size checks to CUDA functions #170

Merged

1 task

drahnr closed this as completed in #170 Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda-memcheck: "Address ... is out of bounds" #169

cuda-memcheck: "Address ... is out of bounds" #169

hweom commented Jul 27, 2022

hweom commented Jul 27, 2022

cuda-memcheck: "Address ... is out of bounds" #169

cuda-memcheck: "Address ... is out of bounds" #169

Comments

hweom commented Jul 27, 2022

hweom commented Jul 27, 2022