Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA][HIP] Trying to use the previous backend #15632

Open
al42and opened this issue Oct 8, 2024 · 2 comments
Open

[CUDA][HIP] Trying to use the previous backend #15632

al42and opened this issue Oct 8, 2024 · 2 comments
Labels
bug Something isn't working cuda CUDA back-end hip Issues related to execution on HIP backend.

Comments

@al42and
Copy link
Contributor

al42and commented Oct 8, 2024

Describe the bug

When both CUDA and HIP devices are present in the system, switching between them causes a crash. Specifically, after a CUDA device is used, submitting operations to a HIP device causes a CUDA_ERROR_CONTEXT_IS_DESTROYED.

A similar crash happens when using CUDA after HIP. Does not seem to happen with level_zero.

To reproduce

#include <iostream>
#include <sycl/sycl.hpp>

int main() {
  const auto devices = sycl::device::get_devices(sycl::info::device_type::gpu);
  for (const auto &dev : devices) {

    std::cout << dev.get_info<sycl::info::device::name>() << std::endl;
    sycl::queue q{dev};

    int stack = 1;
    int *d = sycl::malloc_device<int>(3, q);

    q.memcpy(d, &stack, sizeof(int));
    q.wait();

    sycl::free(d, q);
  }
}
$ clang++ -g -fsycl test-iterate-crash.cpp && ONEAPI_DEVICE_SELECTOR='hip:0;cuda:0' gdb -q -ex run ./a.out 
[........]
NVIDIA GeForce RTX 3060
AMD Radeon RX 6400
<CUDA>[ERROR]: 
UR CUDA ERROR:
        Value:           709
        Name:            CUDA_ERROR_CONTEXT_IS_DESTROYED
        Description:     context is destroyed
        Function:        getNextTransferStream
        Source Location: /home/aland/intel-sycl/llvm/build/_deps/unified-runtime-src/source/adapters/cuda/queue.cpp:113

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)

Thread 1 "a.out" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352472512) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352472512) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737352472512) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737352472512, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7442476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff74287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7ca2b9e in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7cae20c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007ffff7cae277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff7cae4d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff78a38af in void sycl::_V1::detail::Adapter::checkUrResult<(sycl::_V1::errc)1>(ur_result_t) const () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.8
#10 0x00007ffff7a7faea in sycl::_V1::detail::queue_impl::memcpy(std::shared_ptr<sycl::_V1::detail::queue_impl> const&, void*, void const*, unsigned long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&, bool, sycl::_V1::detail::code_location const&) ()
   from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.8
#11 0x00007ffff7b16d6f in sycl::_V1::queue::memcpy(void*, void const*, unsigned long, sycl::_V1::detail::code_location const&) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.8
#12 0x0000000000402685 in main () at test-iterate-crash.cpp:14

Breaking on getNextComputeStream and urUSMDeviceAlloc shows that urUSMDeviceAlloc is called properly, first from CUDA, then from HIP, but urEnqueueUSMMemcpy is called both times from the CUDA plugin:

(gdb) b getNextComputeStream
(gdb) b urUSMDeviceAlloc
(gdb) r
NVIDIA GeForce RTX 3060

Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff650c980 in urUSMDeviceAlloc () from /home/aland/intel-sycl/llvm/build/install//lib/libur_loader.so.0
(gdb) c
Continuing.

Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff64f6cd0 in ur_loader::urUSMDeviceAlloc(ur_context_handle_t_*, ur_device_handle_t_*, ur_usm_desc_t const*, ur_usm_pool_handle_t_*, unsigned long, void**) () from /home/aland/intel-sycl/llvm/build/install//lib/libur_loader.so.0
(gdb) c
Continuing.

Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff76fc1c0 in urUSMDeviceAlloc () from /home/aland/intel-sycl/llvm/build/install//lib/libur_adapter_cuda.so.0
(gdb) 
Continuing.

Thread 1 "a.out" hit Breakpoint 2, 0x00007ffff76f53d0 in ur_queue_handle_t_::getNextComputeStream(unsigned int*) [clone .localalias] () from /home/aland/intel-sycl/llvm/build/install//lib/libur_adapter_cuda.so.0
(gdb) 
Continuing.
AMD Radeon RX 6400

Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff650c980 in urUSMDeviceAlloc () from /home/aland/intel-sycl/llvm/build/install//lib/libur_loader.so.0
(gdb) 
Continuing.

Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff64f6cd0 in ur_loader::urUSMDeviceAlloc(ur_context_handle_t_*, ur_device_handle_t_*, ur_usm_desc_t const*, ur_usm_pool_handle_t_*, unsigned long, void**) () from /home/aland/intel-sycl/llvm/build/install//lib/libur_loader.so.0
(gdb) 
Continuing.

Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff7684b30 in urUSMDeviceAlloc () from /home/aland/intel-sycl/llvm/build/install//lib/libur_adapter_hip.so.0
(gdb) 
Continuing.

Thread 1 "a.out" hit Breakpoint 2, 0x00007ffff76f53d0 in ur_queue_handle_t_::getNextComputeStream(unsigned int*) [clone .localalias] () from /home/aland/intel-sycl/llvm/build/install//lib/libur_adapter_cuda.so.0
(gdb) 
Continuing.
<CUDA>[ERROR]: 
UR CUDA ERROR:
        Value:           709
        Name:            CUDA_ERROR_CONTEXT_IS_DESTROYED
        Description:     context is destroyed
        Function:        getNextComputeStream
        Source Location: /home/aland/intel-sycl/llvm/build/_deps/unified-runtime-src/source/adapters/cuda/queue.cpp:47

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)

Thread 1 "a.out" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352472512) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.

The "Source Location" points to ur_queue_handle_t_::getNextTransferStream in CUDA adapter (cuStreamCreateWithPriority call) while submitting things to the HIP device.

Environment

  • OS: Ubuntu 22.04
  • Target device and vendor: RTX3060 (CUDA 12.3), RX6400 (ROCm 6.1.1)
  • DPC++ version: 1e1757b

Additional context

No response

@al42and al42and added the bug Something isn't working label Oct 8, 2024
@0x12CC 0x12CC added cuda CUDA back-end hip Issues related to execution on HIP backend. labels Oct 8, 2024
@al42and
Copy link
Contributor Author

al42and commented Nov 6, 2024

Still reproduces with Intel oneAPI 2025.0.0 and open-source 37b339e

@kaanolgu
Copy link

kaanolgu commented Nov 10, 2024

Hi @al42and I just noticed that I am having the same issue and seems like the "similar" problem #16038

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda CUDA back-end hip Issues related to execution on HIP backend.
Projects
None yet
Development

No branches or pull requests

3 participants