Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyOpenCL target: Overflow large argument counts into SVM struct #642

Merged
merged 6 commits into from
Sep 12, 2022

Conversation

matthiasdiener
Copy link
Collaborator

@matthiasdiener matthiasdiener commented Jul 1, 2022

@inducer inducer force-pushed the svm-args branch 2 times, most recently from ae9eb7c to 054d05a Compare July 2, 2022 05:32
@inducer inducer changed the title convert entrypoint args to struct PyOpenCL target: Overflow large argument counts into SVM struct Jul 2, 2022
@inducer
Copy link
Owner

inducer commented Jul 2, 2022

With

the following passes for me:

LOOPY_NO_CACHE=1 pycl test_target.py 'test_passing_bajillions_of_svm_args(cl._csc)'    

Let me know if you can reproduce that.

@inducer inducer force-pushed the svm-args branch 3 times, most recently from b643561 to a9b2746 Compare July 2, 2022 19:10
@matthiasdiener
Copy link
Collaborator Author

matthiasdiener commented Jul 6, 2022

I confirmed that this PR works on POCL-pthreads and Nvidia CL when also using inducer/pyopencl#452 (and pocl/pocl#1069 in the POCL-pthreads case).

With POCL-cuda and pocl/pocl#1067 in addition to pocl/pocl#1069 it currently fails when trying to access an array on the host after executing a GPU kernel.

The following code reproduces this:

import numpy as np
import pyopencl as cl
import pyopencl.array as cla

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

alloc = cl.tools.SVMAllocator(ctx, cl.svm_mem_flags.READ_WRITE, queue=queue)
ary = cla.zeros(queue, 20, np.float64, allocator=alloc)
ary.fill(17)  # runs as GPU kernel
ary.copy() # segfaults here

The full output log is (with OCL_ICD_DEBUG=7 and POCL_DEBUG=all):

$ python svm2.py
ocl-icd(ocl_icd_loader.c:776): __initClIcd: Reading icd list from '/home/mdiener/Work/pocl/install/etc/OpenCL/vendors/'
ocl-icd(ocl_icd_loader.c:234): _find_num_icds: return: 1/0x1
ocl-icd(ocl_icd_loader.c:265): _open_driver: Considering file '/home/mdiener/Work/pocl/install/etc/OpenCL/vendors//pocl.icd'
ocl-icd(ocl_icd_loader.c:239): _load_icd: Loading ICD '/home/mdiener/Work/pocl/install/lib/libpocl.so.2.9.0'
ocl-icd(ocl_icd_loader.c:243): _load_icd: ICD[0] loaded
ocl-icd(ocl_icd_loader.c:297): _open_driver: return: 1/0x1
ocl-icd(ocl_icd_loader.c:320): _open_drivers: return: 1/0x1
ocl-icd(ocl_icd_loader.c:477): _find_and_check_platforms: Checking ICD 0/1
ocl-icd(ocl_icd_loader.c:325): _get_function_addr: Looking for function clGetExtensionFunctionAddress
ocl-icd(ocl_icd_loader.c:343): _get_function_addr: return: 140254412730417/0x7f8f8672f031
ocl-icd(ocl_icd_loader.c:325): _get_function_addr: Looking for function clIcdGetPlatformIDsKHR
ocl-icd(ocl_icd_loader.c:328): _get_function_addr: Missing global symbol 'clIcdGetPlatformIDsKHR' in ICD, should be skipped
ocl-icd(ocl_icd_loader.c:343): _get_function_addr: return: 140254412732259/0x7f8f8672f763
ocl-icd(ocl_icd_loader.c:325): _get_function_addr: Looking for function clGetPlatformInfo
ocl-icd(ocl_icd_loader.c:328): _get_function_addr: Missing global symbol 'clGetPlatformInfo' in ICD, should be skipped
ocl-icd(ocl_icd_loader.c:343): _get_function_addr: return: 140254412570414/0x7f8f86707f2e
ocl-icd(ocl_icd_loader.c:526): _find_and_check_platforms: Try to load 1 platforms
ocl-icd(ocl_icd_loader.c:348): _allocate_platforms: Requesting allocation for 1 platforms
ocl-icd(ocl_icd_loader.c:358): _allocate_platforms: return: 1/0x1
ocl-icd(ocl_icd_loader.c:533): _find_and_check_platforms: Checking platform 0
ocl-icd(ocl_icd_loader.c:384): _malloc_clGetPlatformInfo: return: cl_khr_icd cl_pocl_content_size
ocl-icd(ocl_icd_loader.c:384): _malloc_clGetPlatformInfo: return: POCL
ocl-icd(ocl_icd_loader.c:603): _find_and_check_platforms: Extension suffix: POCL
ocl-icd(ocl_icd_loader.c:384): _malloc_clGetPlatformInfo: return: FULL_PROFILE
ocl-icd(ocl_icd_loader.c:384): _malloc_clGetPlatformInfo: return: OpenCL 3.0 PoCL 3.1-pre cuda-svm-0-g8e6dd829  Linux, Debug+Asserts, RELOC, SPIR, LLVM 14.0.6, SLEEF, CUDA, POCL_DEBUG
ocl-icd(ocl_icd_loader.c:384): _malloc_clGetPlatformInfo: return: Portable Computing Language
ocl-icd(ocl_icd_loader.c:384): _malloc_clGetPlatformInfo: return: The pocl project
ocl-icd(ocl_icd_loader.c:431): _sort_platforms: Nb platefroms: 1
ocl-icd(ocl_icd_loader.c:824): __initClIcd: 1 valid vendor(s)!
ocl-icd(ocl_icd_loader.c:1060): clGetPlatformIDs: Entering
ocl-icd(ocl_icd_loader_gen.c:1683): clGetDeviceIDs: Entering
** Final POCL_DEBUG flags: FFFFFFFFFFFFFFFF
[2022-07-06 04:18:58.406713745]POCL: in fn pocl_install_sigfpe_handler at line 229:
  |   GENERAL |  Installing SIGFPE handler...
[2022-07-06 04:18:58.522237845]POCL: in fn pocl_cuda_init at line 397:
  |   GENERAL |  [CUDA] GPU architecture = sm_35
[2022-07-06 04:18:58.522300352]POCL: in fn findLibDevice at line 569:
  |      CUDA | looking for libdevice at '/usr/nvvm/libdevice/libdevice.10.bc'
[2022-07-06 04:18:58.522331286]POCL: in fn findLibDevice at line 581:
  |      CUDA | looking for libdevice at '/usr/nvvm/libdevice/libdevice.compute_35.10.bc'
[2022-07-06 04:18:58.522343606]POCL: in fn findLibDevice at line 569:
  |      CUDA | looking for libdevice at '/usr/nvidia-cuda-toolkit/libdevice/libdevice.10.bc'
[2022-07-06 04:18:58.522360890]POCL: in fn findLibDevice at line 581:
  |      CUDA | looking for libdevice at '/usr/nvidia-cuda-toolkit/libdevice/libdevice.compute_35.10.bc'
[2022-07-06 04:18:58.522373642]POCL: in fn findLibDevice at line 569:
  |      CUDA | looking for libdevice at '/usr/libdevice/libdevice.10.bc'
[2022-07-06 04:18:58.522387107]POCL: in fn findLibDevice at line 581:
  |      CUDA | looking for libdevice at '/usr/libdevice/libdevice.compute_35.10.bc'
[2022-07-06 04:18:58.522400285]POCL: in fn findLibDevice at line 569:
  |      CUDA | looking for libdevice at '/usr/local/lib/cuda/nvvm/libdevice/libdevice.10.bc'
[2022-07-06 04:18:58.522415089]POCL: in fn findLibDevice at line 581:
  |      CUDA | looking for libdevice at '/usr/local/lib/cuda/nvvm/libdevice/libdevice.compute_35.10.bc'
[2022-07-06 04:18:58.522427965]POCL: in fn findLibDevice at line 569:
  |      CUDA | looking for libdevice at '/usr/local/lib/cuda/nvidia-cuda-toolkit/libdevice/libdevice.10.bc'
[2022-07-06 04:18:58.522440770]POCL: in fn findLibDevice at line 581:
  |      CUDA | looking for libdevice at '/usr/local/lib/cuda/nvidia-cuda-toolkit/libdevice/libdevice.compute_35.10.bc'
[2022-07-06 04:18:58.522453608]POCL: in fn findLibDevice at line 569:
  |      CUDA | looking for libdevice at '/usr/local/lib/cuda/libdevice/libdevice.10.bc'
[2022-07-06 04:18:58.522466226]POCL: in fn findLibDevice at line 581:
  |      CUDA | looking for libdevice at '/usr/local/lib/cuda/libdevice/libdevice.compute_35.10.bc'
[2022-07-06 04:18:58.522479006]POCL: in fn findLibDevice at line 569:
  |      CUDA | looking for libdevice at '/usr/local/lib/nvvm/libdevice/libdevice.10.bc'
[2022-07-06 04:18:58.522491669]POCL: in fn findLibDevice at line 581:
  |      CUDA | looking for libdevice at '/usr/local/lib/nvvm/libdevice/libdevice.compute_35.10.bc'
[2022-07-06 04:18:58.522503989]POCL: in fn findLibDevice at line 569:
  |      CUDA | looking for libdevice at '/usr/local/lib/nvidia-cuda-toolkit/libdevice/libdevice.10.bc'
[2022-07-06 04:18:58.522517051]POCL: in fn findLibDevice at line 581:
  |      CUDA | looking for libdevice at '/usr/local/lib/nvidia-cuda-toolkit/libdevice/libdevice.compute_35.10.bc'
[2022-07-06 04:18:58.522529598]POCL: in fn findLibDevice at line 569:
  |      CUDA | looking for libdevice at '/usr/local/lib/libdevice/libdevice.10.bc'
[2022-07-06 04:18:58.522542592]POCL: in fn findLibDevice at line 581:
  |      CUDA | looking for libdevice at '/usr/local/lib/libdevice/libdevice.compute_35.10.bc'
[2022-07-06 04:18:58.522555310]POCL: in fn findLibDevice at line 569:
  |      CUDA | looking for libdevice at '/usr/lib/nvvm/libdevice/libdevice.10.bc'
[2022-07-06 04:18:58.522568743]POCL: in fn findLibDevice at line 581:
  |      CUDA | looking for libdevice at '/usr/lib/nvvm/libdevice/libdevice.compute_35.10.bc'
[2022-07-06 04:18:58.522580592]POCL: in fn findLibDevice at line 569:
  |      CUDA | looking for libdevice at '/usr/lib/nvidia-cuda-toolkit/libdevice/libdevice.10.bc'
[2022-07-06 04:18:58.522596110]POCL: in fn findLibDevice at line 572:
  |      CUDA | found libdevice at '/usr/lib/nvidia-cuda-toolkit/libdevice/libdevice.10.bc'
ocl-icd(ocl_icd_loader_gen.c:1691): clGetDeviceIDs: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1683): clGetDeviceIDs: Entering
ocl-icd(ocl_icd_loader_gen.c:1691): clGetDeviceIDs: return: 0/0x0
ocl-icd(ocl_icd_loader.c:1140): clCreateContext: Entering
[2022-07-06 04:18:58.630074880]POCL: in fn POclCreateCommandQueue at line 47:
  |   GENERAL |  Create Command queue on device 1
[2022-07-06 04:18:58.630114638]POCL: in fn void pocl_llvm_create_context(cl_context) at line 379:
  |      LLVM |  creating LLVM context
ocl-icd(ocl_icd_loader.c:1149): clCreateContext: return: 94104096093728/0x559651383620
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:3259): clCreateCommandQueueWithProperties: Entering
[2022-07-06 04:18:58.630255155]POCL: in fn POclCreateCommandQueue at line 47:
  |   GENERAL |  Create Command queue on device 1
[2022-07-06 04:18:58.630266146]POCL: in fn POclRetainContext at line 32:
  | REFCOUNTS |  Retain Context 0x559651383620  : 2
ocl-icd(ocl_icd_loader_gen.c:3268): clCreateCommandQueueWithProperties: return: 94104096102768/0x559651385970
ocl-icd(ocl_icd_loader_gen.c:1791): clGetCommandQueueInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1797): clGetCommandQueueInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1711): clRetainContext: Entering
[2022-07-06 04:18:58.630350961]POCL: in fn POclRetainContext at line 32:
  | REFCOUNTS |  Retain Context 0x559651383620  : 3
ocl-icd(ocl_icd_loader_gen.c:1717): clRetainContext: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1765): clRetainCommandQueue: Entering
[2022-07-06 04:18:58.630446057]POCL: in fn POclRetainCommandQueue at line 33:
  | REFCOUNTS |  Retain Command Queue 0x559651385970  : 2
ocl-icd(ocl_icd_loader_gen.c:1771): clRetainCommandQueue: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1791): clGetCommandQueueInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1797): clGetCommandQueueInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:3310): clSVMAlloc: Entering
SVM cuMemAllocManaged 160
before write 0 0x4204060000
after write 42.000000
[2022-07-06 04:18:58.630824524]POCL: in fn POclRetainContext at line 32:
  | REFCOUNTS |  Retain Context 0x559651383620  : 4
[2022-07-06 04:18:58.630836034]POCL: in fn POclSVMAlloc at line 114:
  |    MEMORY |  Allocated SVM: PTR 0x4204060000, SIZE 160, FLAGS 1
ocl-icd(ocl_icd_loader_gen.c:3316): clSVMAlloc: return: 283535343616/0x4204060000
ocl-icd(ocl_icd_loader_gen.c:1791): clGetCommandQueueInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1797): clGetCommandQueueInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:3379): clEnqueueSVMMemFill: Entering
[2022-07-06 04:18:58.631313572]POCL: in fn POclRetainCommandQueue at line 33:
  | REFCOUNTS |  Retain Command Queue 0x559651385970  : 3
[2022-07-06 04:18:58.631323842]POCL: in fn pocl_create_event at line 514:
  |    EVENTS |  Created event 0x55965138ddb0 / ID 1 / Command svm_memfill
[2022-07-06 04:18:58.631333776]POCL: in fn pocl_create_command_struct at line 630:
  |    EVENTS |  event pointer provided
[2022-07-06 04:18:58.631342637]POCL: in fn pocl_create_command_struct at line 650:
  |    EVENTS |  Created command struct: CMD 0x55965138dcd0 (event 1 / 0x55965138ddb0, type: svm_memfill)
[2022-07-06 04:18:58.631353096]POCL: in fn pocl_command_enqueue at line 1061:
  |    EVENTS |  In-order Q; adding event syncs
[2022-07-06 04:18:58.631362451]POCL: in fn pocl_command_enqueue at line 1105:
  |    EVENTS |  Pushed Event 1 to CQ 5.
[2022-07-06 04:18:58.631372020]POCL: in fn pocl_update_event_queued at line 1922:
  |    EVENTS |  Event queued: 1
[2022-07-06 04:18:58.631394026]POCL: in fn pocl_update_event_submitted at line 1942:
  |    EVENTS |  Event submitted: 1
BEFORE MEMFILL 160 0x4204060000
AFTER MEMFILL
ocl-icd(ocl_icd_loader_gen.c:3385): clEnqueueSVMMemFill: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2005): clCreateProgramWithSource: Entering
[2022-07-06 04:18:58.632158085]POCL: in fn POclRetainContext at line 32:
  | REFCOUNTS |  Retain Context 0x559651383620  : 5
ocl-icd(ocl_icd_loader_gen.c:2014): clCreateProgramWithSource: return: 94104096146192/0x559651390310
ocl-icd(ocl_icd_loader_gen.c:2066): clBuildProgram: Entering
[2022-07-06 04:18:58.632208081]POCL: in fn compile_and_link_program at line 691:
  |      LLVM |  building program with options -I /shared/home/mdiener/Work/pyopencl/pyopencl/cl
[2022-07-06 04:18:58.632218813]POCL: in fn compile_and_link_program at line 713:
  |      LLVM |  building program for 0 devs with options -I /shared/home/mdiener/Work/pyopencl/pyopencl/cl
[2022-07-06 04:18:58.632228799]POCL: in fn compile_and_link_program at line 717:
  |      LLVM |     BUILDING for device: Tesla K40c
[2022-07-06 04:18:58.632238807]POCL: in fn pocl_driver_build_source at line 511:
  |      LLVM |  building from sources for device 0
[2022-07-06 04:18:58.632438048]POCL: in fn int pocl_llvm_build_program(cl_program, unsigned int, cl_uint, _cl_program* const*, const char**, int) at line 382:
  |      LLVM |  all build options: -mllvm --nvptx-short-ptr -Dcl_khr_int64 -DPOCL_DEVICE_ADDRESS_BITS=64 -D__USE_CLANG_OPENCL_C_H -xcl -Dinline= -I. -cl-kernel-arg-info -I /shared/home/mdiener/Work/pyopencl/pyopencl/cl -D__ENDIAN_LITTLE__=1 -DCL_DEVICE_MAX_GLOBAL_VARIABLE_SIZE=0 -D__OPENCL_VERSION__=120 -cl-std=CL1.2 -D__OPENCL_C_VERSION__=120 -Dcl_khr_byte_addressable_store=1 -Dcl_khr_global_int32_base_atomics=1 -Dcl_khr_global_int32_extended_atomics=1 -Dcl_khr_local_int32_base_atomics=1 -Dcl_khr_local_int32_extended_atomics=1 -Dcl_khr_fp64=1 -Dcl_khr_int64_base_atomics=1 -Dcl_khr_int64_extended_atomics=1 -Dcl_nv_device_attribute_query=1 -Dcl_khr_spir=1 -cl-ext=-all,+cl_khr_byte_addressable_store,+cl_khr_global_int32_base_atomics,+cl_khr_global_int32_extended_atomics,+cl_khr_local_int32_base_atomics,+cl_khr_local_int32_extended_atomics,+cl_khr_fp64,+cl_khr_int64_base_atomics,+cl_khr_int64_extended_atomics,+cl_nv_device_attribute_query,+cl_khr_spir -fno-builtin -triple=nvptx64 -target-cpu sm_35
ocl-icd(ocl_icd_loader_gen.c:2072): clBuildProgram: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2081): clGetProgramInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2087): clGetProgramInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2081): clGetProgramInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2087): clGetProgramInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2097): clGetProgramBuildInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2103): clGetProgramBuildInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2097): clGetProgramBuildInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2103): clGetProgramBuildInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2110): clCreateKernel: Entering
ocl-icd(ocl_icd_loader_gen.c:2119): clCreateKernel: return: 94104096155376/0x5596513926f0
ocl-icd(ocl_icd_loader_gen.c:2178): clGetKernelInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2184): clGetKernelInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2178): clGetKernelInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2184): clGetKernelInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2178): clGetKernelInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2184): clGetKernelInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2178): clGetKernelInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2184): clGetKernelInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2178): clGetKernelInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2184): clGetKernelInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1711): clRetainContext: Entering
[2022-07-06 04:18:58.697513230]POCL: in fn POclRetainContext at line 32:
  | REFCOUNTS |  Retain Context 0x559651383620  : 6
ocl-icd(ocl_icd_loader_gen.c:1717): clRetainContext: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1722): clReleaseContext: Entering
[2022-07-06 04:18:58.697573667]POCL: in fn POclReleaseContext at line 47:
  | REFCOUNTS |  Release Context
ocl-icd(ocl_icd_loader_gen.c:1728): clReleaseContext: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2178): clGetKernelInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2184): clGetKernelInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2178): clGetKernelInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2184): clGetKernelInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1711): clRetainContext: Entering
[2022-07-06 04:18:58.698408613]POCL: in fn POclRetainContext at line 32:
  | REFCOUNTS |  Retain Context 0x559651383620  : 6
ocl-icd(ocl_icd_loader_gen.c:1717): clRetainContext: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1722): clReleaseContext: Entering
[2022-07-06 04:18:58.698444839]POCL: in fn POclReleaseContext at line 47:
  | REFCOUNTS |  Release Context
ocl-icd(ocl_icd_loader_gen.c:1728): clReleaseContext: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2178): clGetKernelInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2184): clGetKernelInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1711): clRetainContext: Entering
[2022-07-06 04:18:58.698467371]POCL: in fn POclRetainContext at line 32:
  | REFCOUNTS |  Retain Context 0x559651383620  : 6
ocl-icd(ocl_icd_loader_gen.c:1717): clRetainContext: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1722): clReleaseContext: Entering
[2022-07-06 04:18:58.698561830]POCL: in fn POclReleaseContext at line 47:
  | REFCOUNTS |  Release Context
ocl-icd(ocl_icd_loader_gen.c:1728): clReleaseContext: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2178): clGetKernelInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2184): clGetKernelInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2178): clGetKernelInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2184): clGetKernelInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2178): clGetKernelInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2184): clGetKernelInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2178): clGetKernelInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2184): clGetKernelInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1711): clRetainContext: Entering
[2022-07-06 04:18:58.698612903]POCL: in fn POclRetainContext at line 32:
  | REFCOUNTS |  Retain Context 0x559651383620  : 6
ocl-icd(ocl_icd_loader_gen.c:1717): clRetainContext: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1737): clGetContextInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1743): clGetContextInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1722): clReleaseContext: Entering
[2022-07-06 04:18:58.698642850]POCL: in fn POclReleaseContext at line 47:
  | REFCOUNTS |  Release Context
ocl-icd(ocl_icd_loader_gen.c:1728): clReleaseContext: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1666): clGetPlatformInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1674): clGetPlatformInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2050): clReleaseProgram: Entering
[2022-07-06 04:18:58.699081032]POCL: in fn POclReleaseProgram at line 50:
  | REFCOUNTS |  Release program 0x559651390310, new refcount: 1, kernel #: 1
ocl-icd(ocl_icd_loader_gen.c:2056): clReleaseProgram: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2194): clGetKernelWorkGroupInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:2200): clGetKernelWorkGroupInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1700): clGetDeviceInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1706): clGetDeviceInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:3441): clSetKernelArgSVMPointer: Entering
[2022-07-06 04:18:58.699270966]POCL: in fn POclSetKernelArgSVMPointer at line 43:
  |   GENERAL |  Setting kernel ARG 0 to SVM 0x4204060000
ocl-icd(ocl_icd_loader_gen.c:3447): clSetKernelArgSVMPointer: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2163): clSetKernelArg: Entering
[2022-07-06 04:18:58.699293930]POCL: in fn POclSetKernelArg at line 107:
  |   GENERAL |  Kernel            fill || SetArg idx   1 ||     long || Local 0 || Size      8 || Value 0x7ffc31e941a8 || Pointer (nil) || *(uint32*)Value:        0 || *(uint64*)Value:        0 ||
Hex Value:  00000000 00000000
ocl-icd(ocl_icd_loader_gen.c:2169): clSetKernelArg: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2163): clSetKernelArg: Entering
[2022-07-06 04:18:58.699312532]POCL: in fn POclSetKernelArg at line 107:
  |   GENERAL |  Kernel            fill || SetArg idx   2 ||   double || Local 0 || Size      8 || Value 0x7ffc31e941a8 || Pointer (nil) || *(uint32*)Value:        0 || *(uint64*)Value: 4625478292286210048 ||
Hex Value:  00000000 00003140
ocl-icd(ocl_icd_loader_gen.c:2169): clSetKernelArg: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2163): clSetKernelArg: Entering
[2022-07-06 04:18:58.699330226]POCL: in fn POclSetKernelArg at line 107:
  |   GENERAL |  Kernel            fill || SetArg idx   3 ||     long || Local 0 || Size      8 || Value 0x7ffc31e941a8 || Pointer (nil) || *(uint32*)Value:        0 || *(uint64*)Value:       20 ||
Hex Value:  14000000 00000000
ocl-icd(ocl_icd_loader_gen.c:2169): clSetKernelArg: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:2507): clEnqueueNDRangeKernel: Entering
[2022-07-06 04:18:58.699356022]POCL: in fn POclEnqueueNDRangeKernel at line 221:
  |   GENERAL |  Queueing kernel fill with local size 32 x 1 x 1 group sizes 1 x 1 x 1...
[2022-07-06 04:18:58.699367414]POCL: in fn POclRetainCommandQueue at line 33:
  | REFCOUNTS |  Retain Command Queue 0x559651385970  : 4
[2022-07-06 04:18:58.699376473]POCL: in fn pocl_create_event at line 514:
  |    EVENTS |  Created event 0x55965144f190 / ID 2 / Command ndrange_kernel
[2022-07-06 04:18:58.699384689]POCL: in fn pocl_create_command_struct at line 630:
  |    EVENTS |  event pointer provided
[2022-07-06 04:18:58.699394485]POCL: in fn pocl_create_event_sync at line 530:
  |    EVENTS |  create event sync: waiting 2 , notifier 1
[2022-07-06 04:18:58.699403512]POCL: in fn pocl_create_command_struct at line 650:
  |    EVENTS |  Created command struct: CMD 0x559650ddf460 (event 2 / 0x55965144f190, type: ndrange_kernel)
[2022-07-06 04:18:58.699413923]POCL: in fn POclRetainKernel at line 33:
  | REFCOUNTS |  Retain Kernel 0x5596513926f0  : 2
[2022-07-06 04:18:58.699422964]POCL: in fn pocl_command_enqueue at line 1061:
  |    EVENTS |  In-order Q; adding event syncs
[2022-07-06 04:18:58.699429314]POCL: in fn pocl_create_event_sync at line 530:
  |    EVENTS |  create event sync: waiting 2 , notifier 1
[2022-07-06 04:18:58.699438005]POCL: in fn pocl_create_event_sync at line 543:
  |    EVENTS |  Skipping event sync creation
[2022-07-06 04:18:58.699446350]POCL: in fn pocl_create_event_sync at line 530:
  |    EVENTS |  create event sync: waiting 2 , notifier 1
[2022-07-06 04:18:58.699454328]POCL: in fn pocl_create_event_sync at line 543:
  |    EVENTS |  Skipping event sync creation
[2022-07-06 04:18:58.699462708]POCL: in fn pocl_command_enqueue at line 1105:
  |    EVENTS |  Pushed Event 2 to CQ 5.
[2022-07-06 04:18:58.699470964]POCL: in fn pocl_update_event_queued at line 1922:
  |    EVENTS |  Event queued: 2
[2022-07-06 04:18:58.699489571]POCL: in fn pocl_update_event_submitted at line 1942:
  |    EVENTS |  Event submitted: 2
ocl-icd(ocl_icd_loader_gen.c:2513): clEnqueueNDRangeKernel: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1765): clRetainCommandQueue: Entering
[2022-07-06 04:18:58.700081115]POCL: in fn POclRetainCommandQueue at line 33:
  | REFCOUNTS |  Retain Command Queue 0x559651385970  : 5
ocl-icd(ocl_icd_loader_gen.c:1771): clRetainCommandQueue: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:1791): clGetCommandQueueInfo: Entering
ocl-icd(ocl_icd_loader_gen.c:1797): clGetCommandQueueInfo: return: 0/0x0
ocl-icd(ocl_icd_loader_gen.c:3310): clSVMAlloc: Entering
SVM cuMemAllocManaged 160
before write 0 0x4204061000
Bus error (core dumped)

i.e., it seems to successfully allocate memory with clSVMAlloc/cuMemAllocManaged, but the resulting memory appears to be inaccessible from the host.

@inducer
Copy link
Owner

inducer commented Jul 6, 2022

I obviously can't guarantee that that's what at issue here, but I suspect you'll need pocl/pocl#1069 (or another fix for the same issue) in order to allow this to work. I'm actually sort of surprised pocl-pthreads worked.

If that doesn't help, a backtrace and potentially the first meaningful thing flagged by valgrind would be of use.

@matthiasdiener
Copy link
Collaborator Author

matthiasdiener commented Jul 6, 2022

I obviously can't guarantee that that's what at issue here, but I suspect you'll need pocl/pocl#1069 (or another fix for the same issue) in order to allow this to work. I'm actually sort of surprised pocl-pthreads worked.

I'm sorry, I should have been clearer. I used inducer/pyopencl#452 and pocl/pocl#1069 (for POCL-pthreads and POCL-cuda) for all tests. Note that my failing example above doesn't even use loopy (the loopy test in this PR just exposed it), so this PR might be the wrong location to track this issue.

If that doesn't help, a backtrace and potentially the first meaningful thing flagged by valgrind would be of use.

Here is a part of the backtrace at the point of the crash. It doesn't appear to be too useful though.

SVM cuMemAllocManaged 160
before write 0 0x4204061000
--Type <RET> for more, q to quit, c to continue without paging--q

Thread 1 "python" received signal SIGBUS, Bus error.
0x00007fff8510e39c in pocl_cuda_svm_alloc (dev=0x555555f27020, flags=1, size=160)
    at /home/mdiener/Work/pocl/lib/CL/devices/cuda/pocl-cuda.c:1895
1895	  ((double*)dptr)[0] = 42.0;
(gdb) bt
#0  0x00007fff8510e39c in pocl_cuda_svm_alloc (dev=0x555555f27020, flags=1, size=160)
    at /home/mdiener/Work/pocl/lib/CL/devices/cuda/pocl-cuda.c:1895
#1  0x00007fff91c7cfb6 in POclSVMAlloc (context=0x5555564b86d0, flags=1, size=160, alignment=128)
    at /home/mdiener/Work/pocl/lib/CL/clSVMAlloc.c:98
#2  0x00007fff9220638f in clSVMAlloc () from /shared/home/mdiener/Work/emirge/miniforge3/envs/poclbuild/lib/libOpenCL.so.1
#3  0x00007fff922bb337 in pybind11::cpp_function::initialize<pybind11::detail::initimpl::constructor<std::shared_ptr<pyopencl::context>, unsigned long, unsigned int, unsigned long, pyopencl::command_queue const*>::execute<pybind11::class_<pyopencl::svm_allocation>, pybind11::arg, pybind11::arg, pybind11::arg, pybind11::arg, pybind11::arg_v, 0>(pybind11::class_<pyopencl::svm_allocation>&, pybind11::arg const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::value_and_holder&, std::shared_ptr<pyopencl::context>, unsigned long, unsigned int, unsigned long, pyopencl::command_queue const*)#1}, void, pybind11::detail::value_and_holder&, std::shared_ptr<pyopencl::context>, unsigned long, unsigned int, unsigned long, pyopencl::command_queue const*, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::detail::is_new_style_constructor, pybind11::arg, pybind11::arg, pybind11::arg, pybind11::arg, pybind11::arg_v>(pybind11::class_<pyopencl::svm_allocation>&&, void (*)(pybind11::detail::value_and_holder&, std::shared_ptr<pyopencl::context>, unsigned long, unsigned int, unsigned long, pyopencl::command_queue const*), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::detail::is_new_style_constructor const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
   from /shared/home/mdiener/Work/pyopencl/pyopencl/_cl.cpython-310-x86_64-linux-gnu.so
#4  0x00007fff922494a6 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
   from /shared/home/mdiener/Work/pyopencl/pyopencl/_cl.cpython-310-x86_64-linux-gnu.so
#5  0x000055555569850c in cfunction_call (func=0x7fff9240e1b0, args=<optimized out>, kwargs=<optimized out>)
    at /usr/local/src/conda/python-3.10.5/Objects/methodobject.c:543
#6  0x00005555556a6db9 in _PyObject_Call (kwargs=<optimized out>, args=0x7fff91e2cdc0, callable=0x7fff9240e1b0, tstate=0x55555591be10)
    at /usr/local/src/conda/python-3.10.5/Objects/call.c:305
[...]

I modified pocl/pocl#1067 such that it tries to write to the just allocated buffer after a successful cuMemAllocManaged (this just triggers the issue a bit earlier). What happens is that cuMemAllocManaged appears to succeed, but the returned memory can not be read or written on the host. This only appears after a kernel has run on the GPU. As far as I can see, no SVM allocated memory is freed at all for the example code I used.

This is on koelsch with a Tesla K40c.

@inducer
Copy link
Owner

inducer commented Jul 7, 2022

With CU_MEM_ATTACH_GLOBAL, I don't think you have a guarantee that the memory should be accessible from the host. Also, since you seem to attribute the crash in the sample code from #642 (comment) to host-side access, could you explain where you think that host-side access is happening? (I don't see it. A backtrace would help.)

@inducer
Copy link
Owner

inducer commented Jul 7, 2022

Btw, I agree that this discussion does not have much to do with Loopy. Maybe let's continue the discussion here: inducer/pyopencl#452.

@matthiasdiener
Copy link
Collaborator Author

(Continuing the discussion here for a bit since I got the loopy test in this PR running with the change outlined below)

With CU_MEM_ATTACH_GLOBAL, I don't think you have a guarantee that the memory should be accessible from the host.

Hmm, that is interesting. Based on your comment, I tried CU_MEM_ATTACH_HOST, and with that change, my test case above and the test in this PR run successfully on POCL-cuda. I'm not sure I understand why though, based on the documentation:

flags specifies the default stream association for this allocation. flags must be one of CU_MEM_ATTACH_GLOBAL or CU_MEM_ATTACH_HOST. If CU_MEM_ATTACH_GLOBAL is specified, then this memory is accessible from any stream on any device. If CU_MEM_ATTACH_HOST is specified, then the allocation should not be accessed from devices that have a zero value for the device attribute CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS; an explicit call to cuStreamAttachMemAsync will be required to enable access on such devices.

The device I'm running on (K40c) returns 0 for CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS (but 1 for CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY).

I found another fix (workaround?) in pocl/pocl@03ffc71 which just uses CUDA functions for the memfill operation. With that fix, my simple test and the test in this PR also work.

Also, since you seem to attribute the crash in the sample code from #642 (comment) to host-side access, could you explain where you think that host-side access is happening? (I don't see it. A backtrace would help.)

Here is a backtrace with my debug cruft removed:

--Type <RET> for more, q to quit, c to continue without paging--

Thread 1 "python" received signal SIGBUS, Bus error.
0x00007fff91d71bab in pocl_fill_aligned_buf_with_pattern (ptr=0x4204061000, offset=0, size=160, pattern=0x55555650d600, pattern_size=1)
    at /home/mdiener/Work/pocl/lib/CL/pocl_util.c:2347
2347	          p[i] = *(uint8_t *)pattern;
(gdb) bt
#0  0x00007fff91d71bab in pocl_fill_aligned_buf_with_pattern (ptr=0x4204061000, offset=0, size=160, pattern=0x55555650d600, pattern_size=1)
    at /home/mdiener/Work/pocl/lib/CL/pocl_util.c:2347
#1  0x00007fff91d91088 in pocl_driver_memfill (data=0x555555ef2c80, dst_mem_id=0x7fffffffd390, dst_buf=0x0, size=160, offset=0,
    pattern=0x55555650d600, pattern_size=1) at /home/mdiener/Work/pocl/lib/CL/devices/common_driver.c:307
#2  0x00007fff91d91690 in pocl_driver_svm_fill (dev=0x555555f08f00, svm_ptr=0x4204061000, size=160, pattern=0x55555650d600, pattern_size=1)
    at /home/mdiener/Work/pocl/lib/CL/devices/common_driver.c:434
#3  0x00007fff8520c96c in pocl_cuda_submit_node (node=0x5555564c2a00, cq=0x5555564ad6e0, locked=1)
    at /home/mdiener/Work/pocl/lib/CL/devices/cuda/pocl-cuda.c:1469
#4  0x00007fff8520d02c in pocl_cuda_submit (node=0x5555564c2a00, cq=0x5555564ad6e0)
    at /home/mdiener/Work/pocl/lib/CL/devices/cuda/pocl-cuda.c:1557
#5  0x00007fff91d6cf47 in pocl_command_enqueue (command_queue=0x5555564ad6e0, node=0x5555564c2a00)
    at /home/mdiener/Work/pocl/lib/CL/pocl_util.c:1114
#6  0x00007fff91d80469 in POclEnqueueSVMMemFill (command_queue=0x5555564ad6e0, svm_ptr=0x4204061000, pattern=0x7fff923f7b20,
    pattern_size=1, size=160, num_events_in_wait_list=0, event_wait_list=0x0, event=0x7fffffffd660)
    at /home/mdiener/Work/pocl/lib/CL/clEnqueueSVMMemFill.c:89
#7  0x00007fff922068cc in clEnqueueSVMMemFill () from /shared/home/mdiener/Work/emirge/miniforge3/envs/poclbuild/lib/libOpenCL.so.1
#8  0x00007fff922aa6d2 in pyopencl::enqueue_svm_memfill(pyopencl::command_queue&, pyopencl::svm_arg_wrapper&, pybind11::object, pybind11::object, pybind11::object) () from /shared/home/mdiener/Work/pyopencl/pyopencl/_cl.cpython-310-x86_64-linux-gnu.so
#9  0x00007fff922a4824 in pybind11::cpp_function::initialize<pyopencl::event* (*&)(pyopencl::command_queue&, pyopencl::svm_arg_wrapper&, pybind11::object, pybind11::object, pybind11::object), pyopencl::event*, pyopencl::command_queue&, pyopencl::svm_arg_wrapper&, pybind11::object, pybind11::object, pybind11::object, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::arg_v>(pyopencl::event* (*&)(pyopencl::command_queue&, pyopencl::svm_arg_wrapper&, pybind11::object, pybind11::object, pybind11::object), pyopencl::event* (*)(pyopencl::command_queue&, pyopencl::svm_arg_wrapper&, pybind11::object, pybind11::object, pybind11::object), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
   from /shared/home/mdiener/Work/pyopencl/pyopencl/_cl.cpython-310-x86_64-linux-gnu.so
[...]

It crashes in the memfill operation when accessing ptr (p). Note that the arguments are an SVM array (ptr) and a normal allocation (pattern). Again, pocl/pocl@03ffc71 fixes this.

@inducer
Copy link
Owner

inducer commented Jul 10, 2022

I found another fix (workaround?) in pocl/pocl@03ffc71 which just uses CUDA functions for the memfill operation. With that fix, my simple test and the test in this PR also work.

OK, nice. As far as I can tell, using CUDA to do the SVM fill is the correct thing to do. Otherwise it's neither queue-synchronized nor performed from the device.

I don't believe CU_MEM_ATTACH_HOST is the correct flag to use; we should not need the memory to be host-visibile. And upon rereading the phrasing of the description of CU_MEM_ATTACH_GLOBAL, I do not think it means to imply that the memory is host-visible. ("any stream" sort of implies GPU execution, and I suspect they mean "device" in the CUDA sense as well)

@matthiasdiener matthiasdiener force-pushed the svm-args branch 2 times, most recently from f8a6477 to 92122b2 Compare July 13, 2022 19:35
@inducer
Copy link
Owner

inducer commented Jul 13, 2022

@matthiasdiener Please don't force-push to branches on which more than one person is working. Not only is there a risk of clobbering one another's work, it's also very hard to review what's being changed.

loopy/target/pyopencl.py Outdated Show resolved Hide resolved
@matthiasdiener
Copy link
Collaborator Author

matthiasdiener commented Aug 16, 2022

With the current version of mirgecom, and setting limit_arg_size_nbytes to 20, this crashes:

$ python examples/wave.py --lazy
Choose platform:
[0] <pyopencl.Platform 'Portable Computing Language' at 0x7f00190cb008>
Choice [0]:

Choose device(s):
[0] <pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz' on 'Portable Computing Language' at 0x55b419f74ff0>
[1] <pyopencl.Device 'Tesla K40c' on 'Portable Computing Language' at 0x55b419f753f0>
Choice, comma-separated [0]:Set the environment variable PYOPENCL_CTX=':' to avoid being asked again.
Traceback (most recent call last):
  File "/shared/home/mdiener/Work/svmfuse/mirgecom/examples/wave.py", line 190, in <module>
    main(use_profiling=args.profile, use_logmgr=args.logging, lazy=args.lazy)
  File "/shared/home/mdiener/Work/svmfuse/mirgecom/examples/wave.py", line 112, in main
    nodes = actx.thaw(discr.nodes())
  File "/shared/home/mdiener/Work/svmfuse/grudge/grudge/discretization.py", line 729, in nodes
    return self.discr_from_dd(dd).nodes()
  File "/shared/home/mdiener/Work/svmfuse/meshmode/meshmode/discretization/__init__.py", line 679, in nodes
    result = make_obj_array([
  File "/shared/home/mdiener/Work/svmfuse/meshmode/meshmode/discretization/__init__.py", line 680, in <listcomp>
    _DOFArray(None, tuple([
  File "/shared/home/mdiener/Work/svmfuse/meshmode/meshmode/discretization/__init__.py", line 681, in <listcomp>
    actx.freeze(resample_mesh_nodes(grp, iaxis)) for grp in self.groups
  File "/shared/home/mdiener/Work/svmfuse/arraycontext/arraycontext/impl/pytato/__init__.py", line 426, in freeze
    evt, out_dict = pt_prg(self.queue, **bound_arguments)
  File "/shared/home/mdiener/Work/svmfuse/pytato/pytato/target/loopy/__init__.py", line 212, in __call__
    return self.program(queue,
  File "/shared/home/mdiener/Work/svmfuse/loopy/loopy/translation_unit.py", line 347, in __call__
    return pex(*args, **kwargs)
  File "/shared/home/mdiener/Work/svmfuse/loopy/loopy/target/pyopencl_execution.py", line 387, in __call__
    return translation_unit_info.invoker(
  File "/shared/home/mdiener/Work/svmfuse/miniforge3/envs/ceesd/lib/python3.9/site-packages/pytools/py_codegen.py", line 150, in __call__
    return self.func(*args, **kwargs)
  File "<generated code for 'invoke_frozen_nodes0_2d_loopy_kernel'>", line 146, in invoke_frozen_nodes0_2d_loopy_kernel
  File "<generated code for 'invoke_frozen_nodes0_2d_loopy_kernel'>", line 27, in _lpy_host_frozen_nodes0_2d
AttributeError: 'pyopencl._cl.Buffer' object has no attribute 'svm_ptr'

Edit: This was resolved by rebasing Kaushik's branch of arraycontext.

test/test_target.py Outdated Show resolved Hide resolved
@inducer inducer enabled auto-merge (rebase) September 12, 2022 03:50
@inducer inducer removed the request for review from kaushikcfd September 12, 2022 03:50
@inducer inducer merged commit ed6b298 into main Sep 12, 2022
@inducer inducer deleted the svm-args branch September 12, 2022 04:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants