Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda-memcheck: "Address ... is out of bounds" #169

Closed
hweom opened this issue Jul 27, 2022 · 1 comment · Fixed by #170
Closed

cuda-memcheck: "Address ... is out of bounds" #169

hweom opened this issue Jul 27, 2022 · 1 comment · Fixed by #170
Labels
bug Something doesn't quite look right

Comments

@hweom
Copy link
Contributor

hweom commented Jul 27, 2022

Describe the bug

cuda-memcheck reports scrolling errors on example-mnist-classification like this:

========= Invalid __global__ write of size 4
=========     at 0x00001780 in void copy_kernel<float>(cublasCopyParams<float>)
=========     by thread (191,0,0) in block (0,0,0)
=========     Address 0x7fd319043efc is out of bounds

To Reproduce

Steps to reproduce the behaviour:

  1. cargo build
  2. cuda-memcheck target/debug/example-mnist-classification mnist linear

Expected behavior

No errors.

Please complete the following information:

  • System: Manjaro Linux
  • Version: Git commit 854043d
  • Rust: rustc 1.62.1 (e092d0b6b 2022-07-16)
  • Environment:
  • Backends (if relevant):
    • cuda:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57       Driver Version: 515.57       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   51C    P8     5W /  N/A |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1936      G   /usr/lib/Xorg                       4MiB |
+-----------------------------------------------------------------------------+

Additional context

Note that running example-mnist-classification without cuda-memcheck works just fine and is able to converge. I only discovered this while working on #159 where doing training with CUDA does crash with CUDA_ERROR_ILLEGAL_ADDRESS when trying to copy from GPU to host. Not sure it's the same issue, but seems related.

@hweom hweom added the bug Something doesn't quite look right label Jul 27, 2022
@hweom
Copy link
Contributor Author

hweom commented Jul 27, 2022

Looks like the copy() function for CUDA doesn't check that the destination is large enough (here).

By adding a check like this:

macro_rules! iblas_copy_for_cuda {
    ($t:ident) => {
        fn copy(
            &self,
            x: &SharedTensor<$t>,
            y: &mut SharedTensor<$t>,
        ) -> Result<(), ::coaster::error::Error> {
            assert_eq!(x.desc().size(), y.desc().size());

We now get a panic:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `300`,
 right: `10`', coaster-blas/src/frameworks/cuda/mod.rs:23:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
   4: coaster_blas::frameworks::cuda::<impl coaster_blas::plugin::Copy<f32> for coaster::backend::Backend<coaster::frameworks::cuda::Cuda>>::copy
             at ./coaster-blas/src/frameworks/cuda/helper.rs:109:13
   5: <juice::layers::common::linear::Linear as juice::layer::ComputeParametersGradient<f32,B>>::compute_parameters_gradient
             at ./juice/src/layers/common/linear.rs:220:9

So we're trying to copy 300 floats into a tensor with size 10. And it happens in the Linear layer here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something doesn't quite look right
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant