You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When both CUDA and HIP devices are present in the system, switching between them causes a crash. Specifically, after a CUDA device is used, submitting operations to a HIP device causes a CUDA_ERROR_CONTEXT_IS_DESTROYED.
A similar crash happens when using CUDA after HIP. Does not seem to happen with level_zero.
$ clang++ -g -fsycl test-iterate-crash.cpp && ONEAPI_DEVICE_SELECTOR='hip:0;cuda:0' gdb -q -ex run ./a.out [........]NVIDIA GeForce RTX 3060AMD Radeon RX 6400<CUDA>[ERROR]: UR CUDA ERROR: Value: 709 Name: CUDA_ERROR_CONTEXT_IS_DESTROYED Description: context is destroyed Function: getNextTransferStream Source Location: /home/aland/intel-sycl/llvm/build/_deps/unified-runtime-src/source/adapters/cuda/queue.cpp:113terminate called after throwing an instance of 'sycl::_V1::exception' what(): Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)Thread 1 "a.out" received signal SIGABRT, Aborted.__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352472512) at ./nptl/pthread_kill.c:4444 ./nptl/pthread_kill.c: No such file or directory.(gdb) bt#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352472512) at ./nptl/pthread_kill.c:44#1 __pthread_kill_internal (signo=6, threadid=140737352472512) at ./nptl/pthread_kill.c:78#2 __GI___pthread_kill (threadid=140737352472512, signo=signo@entry=6) at ./nptl/pthread_kill.c:89#3 0x00007ffff7442476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26#4 0x00007ffff74287f3 in __GI_abort () at ./stdlib/abort.c:79#5 0x00007ffff7ca2b9e in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6#6 0x00007ffff7cae20c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6#7 0x00007ffff7cae277 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6#8 0x00007ffff7cae4d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6#9 0x00007ffff78a38af in void sycl::_V1::detail::Adapter::checkUrResult<(sycl::_V1::errc)1>(ur_result_t) const () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.8#10 0x00007ffff7a7faea in sycl::_V1::detail::queue_impl::memcpy(std::shared_ptr<sycl::_V1::detail::queue_impl> const&, void*, void const*, unsigned long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&, bool, sycl::_V1::detail::code_location const&) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.8#11 0x00007ffff7b16d6f in sycl::_V1::queue::memcpy(void*, void const*, unsigned long, sycl::_V1::detail::code_location const&) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.8#12 0x0000000000402685 in main () at test-iterate-crash.cpp:14
Breaking on getNextComputeStream and urUSMDeviceAlloc shows that urUSMDeviceAlloc is called properly, first from CUDA, then from HIP, but urEnqueueUSMMemcpy is called both times from the CUDA plugin:
(gdb) b getNextComputeStream(gdb) b urUSMDeviceAlloc(gdb) rNVIDIA GeForce RTX 3060Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff650c980 in urUSMDeviceAlloc () from /home/aland/intel-sycl/llvm/build/install//lib/libur_loader.so.0(gdb) cContinuing.Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff64f6cd0 in ur_loader::urUSMDeviceAlloc(ur_context_handle_t_*, ur_device_handle_t_*, ur_usm_desc_t const*, ur_usm_pool_handle_t_*, unsigned long, void**) () from /home/aland/intel-sycl/llvm/build/install//lib/libur_loader.so.0(gdb) cContinuing.Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff76fc1c0 in urUSMDeviceAlloc () from /home/aland/intel-sycl/llvm/build/install//lib/libur_adapter_cuda.so.0(gdb) Continuing.Thread 1 "a.out" hit Breakpoint 2, 0x00007ffff76f53d0 in ur_queue_handle_t_::getNextComputeStream(unsigned int*) [clone .localalias] () from /home/aland/intel-sycl/llvm/build/install//lib/libur_adapter_cuda.so.0(gdb) Continuing.AMD Radeon RX 6400Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff650c980 in urUSMDeviceAlloc () from /home/aland/intel-sycl/llvm/build/install//lib/libur_loader.so.0(gdb) Continuing.Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff64f6cd0 in ur_loader::urUSMDeviceAlloc(ur_context_handle_t_*, ur_device_handle_t_*, ur_usm_desc_t const*, ur_usm_pool_handle_t_*, unsigned long, void**) () from /home/aland/intel-sycl/llvm/build/install//lib/libur_loader.so.0(gdb) Continuing.Thread 1 "a.out" hit Breakpoint 4, 0x00007ffff7684b30 in urUSMDeviceAlloc () from /home/aland/intel-sycl/llvm/build/install//lib/libur_adapter_hip.so.0(gdb) Continuing.Thread 1 "a.out" hit Breakpoint 2, 0x00007ffff76f53d0 in ur_queue_handle_t_::getNextComputeStream(unsigned int*) [clone .localalias] () from /home/aland/intel-sycl/llvm/build/install//lib/libur_adapter_cuda.so.0(gdb) Continuing.<CUDA>[ERROR]: UR CUDA ERROR: Value: 709 Name: CUDA_ERROR_CONTEXT_IS_DESTROYED Description: context is destroyed Function: getNextComputeStream Source Location: /home/aland/intel-sycl/llvm/build/_deps/unified-runtime-src/source/adapters/cuda/queue.cpp:47terminate called after throwing an instance of 'sycl::_V1::exception' what(): Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)Thread 1 "a.out" received signal SIGABRT, Aborted.__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352472512) at ./nptl/pthread_kill.c:4444 ./nptl/pthread_kill.c: No such file or directory.
The "Source Location" points to ur_queue_handle_t_::getNextTransferStream in CUDA adapter (cuStreamCreateWithPriority call) while submitting things to the HIP device.
Environment
OS: Ubuntu 22.04
Target device and vendor: RTX3060 (CUDA 12.3), RX6400 (ROCm 6.1.1)
Describe the bug
When both CUDA and HIP devices are present in the system, switching between them causes a crash. Specifically, after a CUDA device is used, submitting operations to a HIP device causes a
CUDA_ERROR_CONTEXT_IS_DESTROYED
.A similar crash happens when using CUDA after HIP. Does not seem to happen with
level_zero
.To reproduce
Breaking on
getNextComputeStream
andurUSMDeviceAlloc
shows thaturUSMDeviceAlloc
is called properly, first from CUDA, then from HIP, buturEnqueueUSMMemcpy
is called both times from the CUDA plugin:The "Source Location" points to
ur_queue_handle_t_::getNextTransferStream
in CUDA adapter (cuStreamCreateWithPriority
call) while submitting things to the HIP device.Environment
Additional context
No response
The text was updated successfully, but these errors were encountered: