[CUDA][HIP] Trying to use the previous backend #15632

al42and · 2024-10-08T14:49:54Z

Describe the bug

When both CUDA and HIP devices are present in the system, switching between them causes a crash. Specifically, after a CUDA device is used, submitting operations to a HIP device causes a CUDA_ERROR_CONTEXT_IS_DESTROYED.

A similar crash happens when using CUDA after HIP. Does not seem to happen with level_zero.

To reproduce

#include <iostream>
#include <sycl/sycl.hpp>

int main() {
  const auto devices = sycl::device::get_devices(sycl::info::device_type::gpu);
  for (const auto &dev : devices) {

    std::cout << dev.get_info<sycl::info::device::name>() << std::endl;
    sycl::queue q{dev};

    int stack = 1;
    int *d = sycl::malloc_device<int>(3, q);

    q.memcpy(d, &stack, sizeof(int));
    q.wait();

    sycl::free(d, q);
  }
}

$ clang++ -g -fsycl test-iterate-crash.cpp && ONEAPI_DEVICE_SELECTOR='hip:0;cuda:0' gdb -q -ex run ./a.out 
[........]
NVIDIA GeForce RTX 3060
AMD Radeon RX 6400
<CUDA>[ERROR]: 
UR CUDA ERROR:
        Value:           709
        Name:            CUDA_ERROR_CONTEXT_IS_DESTROYED
        Description:     context is destroyed
        Function:        getNextTransferStream
        Source Location: /home/aland/intel-sycl/llvm/build/_deps/unified-runtime-src/source/adapters/cuda/queue.cpp:113

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)

Thread 1 "a.out" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352472512) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352472512) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737352472512) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737352472512, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7442476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff74287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7ca2b9e in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7cae20c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007ffff7cae277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff7cae4d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ffff78a38af in void sycl::_V1::detail::Adapter::checkUrResult<(sycl::_V1::errc)1>(ur_result_t) const () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.8
#10 0x00007ffff7a7faea in sycl::_V1::detail::queue_impl::memcpy(std::shared_ptr<sycl::_V1::detail::queue_impl> const&, void*, void const*, unsigned long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&, bool, sycl::_V1::detail::code_location const&) ()
   from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.8
#11 0x00007ffff7b16d6f in sycl::_V1::queue::memcpy(void*, void const*, unsigned long, sycl::_V1::detail::code_location const&) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.8
#12 0x0000000000402685 in main () at test-iterate-crash.cpp:14

Breaking on getNextComputeStream and urUSMDeviceAlloc shows that urUSMDeviceAlloc is called properly, first from CUDA, then from HIP, but urEnqueueUSMMemcpy is called both times from the CUDA plugin:

(gdb) b getNextComputeStream
(gdb) b urUSMDeviceAlloc
(gdb) r
NVIDIA GeForce RTX 3060

Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff650c980 in urUSMDeviceAlloc () from /home/aland/intel-sycl/llvm/build/install//lib/libur_loader.so.0
(gdb) c
Continuing.

Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff64f6cd0 in ur_loader::urUSMDeviceAlloc(ur_context_handle_t_*, ur_device_handle_t_*, ur_usm_desc_t const*, ur_usm_pool_handle_t_*, unsigned long, void**) () from /home/aland/intel-sycl/llvm/build/install//lib/libur_loader.so.0
(gdb) c
Continuing.

Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff76fc1c0 in urUSMDeviceAlloc () from /home/aland/intel-sycl/llvm/build/install//lib/libur_adapter_cuda.so.0
(gdb) 
Continuing.

Thread 1 "a.out" hit Breakpoint 2, 0x00007ffff76f53d0 in ur_queue_handle_t_::getNextComputeStream(unsigned int*) [clone .localalias] () from /home/aland/intel-sycl/llvm/build/install//lib/libur_adapter_cuda.so.0
(gdb) 
Continuing.
AMD Radeon RX 6400

Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff650c980 in urUSMDeviceAlloc () from /home/aland/intel-sycl/llvm/build/install//lib/libur_loader.so.0
(gdb) 
Continuing.

Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff64f6cd0 in ur_loader::urUSMDeviceAlloc(ur_context_handle_t_*, ur_device_handle_t_*, ur_usm_desc_t const*, ur_usm_pool_handle_t_*, unsigned long, void**) () from /home/aland/intel-sycl/llvm/build/install//lib/libur_loader.so.0
(gdb) 
Continuing.

Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff7684b30 in urUSMDeviceAlloc () from /home/aland/intel-sycl/llvm/build/install//lib/libur_adapter_hip.so.0
(gdb) 
Continuing.

Thread 1 "a.out" hit Breakpoint 2, 0x00007ffff76f53d0 in ur_queue_handle_t_::getNextComputeStream(unsigned int*) [clone .localalias] () from /home/aland/intel-sycl/llvm/build/install//lib/libur_adapter_cuda.so.0
(gdb) 
Continuing.
<CUDA>[ERROR]: 
UR CUDA ERROR:
        Value:           709
        Name:            CUDA_ERROR_CONTEXT_IS_DESTROYED
        Description:     context is destroyed
        Function:        getNextComputeStream
        Source Location: /home/aland/intel-sycl/llvm/build/_deps/unified-runtime-src/source/adapters/cuda/queue.cpp:47

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)

Thread 1 "a.out" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352472512) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.

The "Source Location" points to ur_queue_handle_t_::getNextTransferStream in CUDA adapter (cuStreamCreateWithPriority call) while submitting things to the HIP device.

Environment

OS: Ubuntu 22.04
Target device and vendor: RTX3060 (CUDA 12.3), RX6400 (ROCm 6.1.1)
DPC++ version: 1e1757b

Additional context

No response

The text was updated successfully, but these errors were encountered:

al42and · 2024-11-06T13:47:10Z

Still reproduces with Intel oneAPI 2025.0.0 and open-source 37b339e

kaanolgu · 2024-11-10T16:57:45Z

Hi @al42and I just noticed that I am having the same issue and seems like the "similar" problem #16038

al42and added the bug Something isn't working label Oct 8, 2024

0x12CC added cuda CUDA back-end hip Issues related to execution on HIP backend. labels Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA][HIP] Trying to use the previous backend #15632

[CUDA][HIP] Trying to use the previous backend #15632

al42and commented Oct 8, 2024 •

edited

Loading

al42and commented Nov 6, 2024

kaanolgu commented Nov 10, 2024 •

edited

Loading

[CUDA][HIP] Trying to use the previous backend #15632

[CUDA][HIP] Trying to use the previous backend #15632

Comments

al42and commented Oct 8, 2024 • edited Loading

Describe the bug

To reproduce

Environment

Additional context

al42and commented Nov 6, 2024

kaanolgu commented Nov 10, 2024 • edited Loading

al42and commented Oct 8, 2024 •

edited

Loading

kaanolgu commented Nov 10, 2024 •

edited

Loading