When devicePlugin.passDeviceSpecsEnabled is set to true, GPU pod failed to start (failed to apply OCI options) #776

Rei1010 · 2025-01-03T09:23:42Z

What happened:
Install HAMi helm with default options, GPU pod failed to run with error "Error: failed to generate container "180e8893b52ae58994b7db777ee79513fbf6555e10d49cb686f94e8f88666074" spec: failed to apply OCI options: lstat /dev/nvidiactl: no such file or directory"

(

HAMi/charts/hami/values.yaml

Line 139 in b8548c3

passDeviceSpecsEnabled: true

)

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
Just install HAMi with latest code and it reproduced

Anything else we need to know?:

Test yaml:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:22.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 1 # declare how many physical GPUs the pod needs
          nvidia.com/gpumem: 3000 # identifies 3000M GPU memory each physical GPU allocates to the pod （Optional,Integer）
          nvidia.com/gpucores: 30 # identifies 30% GPU GPU core each physical GPU allocates to the pod （Optional,Integer)

Enable passDeviceSpecsEnabled

Disable passDeviceSpecsEnabled

Related PR:
#690

Environment:

HAMi version:
nvidia driver or other AI device driver version:
Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Others:

The text was updated successfully, but these errors were encountered:

Nimbus318 · 2025-01-06T02:35:44Z

Hi @Rei1010 ,

To help us diagnose the issue, could you please:

Check NVIDIA devices: Run ls -l /dev/nvidia* on your machine to list the NVIDIA device files.

Test PyTorch GPU access: With passDeviceSpecsEnabled: false, create a GPUPod using an image with PyTorch. Inside the pod, run:

$ python
>>> import torch
>>> torch.cuda.is_available()

This will help us understand the problem better. Thanks!

Rei1010 · 2025-01-06T03:03:52Z

Hi @Nimbus318 ,

Helm with disable passDeviceSpecsEnabled:

helm  get values -n hami-system hami
USER-SUPPLIED VALUES:
devicePlugin:
  passDeviceSpecsEnabled: false

GPU Devics:

root@gpu-master:~# lspci | grep -i nvidia
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)

jovyan@gpu-pod1:/$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Jan  6 01:42 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jan  6 01:42 /dev/nvidiactl
crw-rw-rw- 1 root root 234,   0 Jan  6 01:42 /dev/nvidia-uvm
crw-rw-rw- 1 root root 234,   1 Jan  6 01:43 /dev/nvidia-uvm-tools

Pytorch:

jovyan@gpu-pod1:/$ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
[HAMI-core Msg(52:139839174941568:libvgpu.c:836)]: Initializing.....
>>> torch.cuda.is_available()
[HAMI-core Msg(52:139839174941568:libvgpu.c:855)]: Initialized
True

Nvidia-smi:

[HAMI-core Msg(27:140061859383104:libvgpu.c:836)]: Initializing.....
Mon Jan  6 02:56:55 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       On  | 00000000:0B:00.0 Off |                    0 |
| N/A   35C    P8               6W /  75W |      0MiB /   300MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[HAMI-core Msg(27:140061859383104:multiprocess_memory_limit.c:497)]: Calling exit handler 27

Run CUDA vectorAdd

root@gpu-pod1:/# ./cuda-samples/vectorAdd
[Vector addition of 50000 elements]
[HAMI-core Msg(42:140071454576640:libvgpu.c:836)]: Initializing.....
[HAMI-core Msg(42:140071454576640:libvgpu.c:855)]: Initialized
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
[HAMI-core Msg(42:140071454576640:multiprocess_memory_limit.c:497)]: Calling exit handler 42

Nimbus318 · 2025-01-06T03:33:04Z

Thanks for your report! Could you please try setting passDeviceSpecsEnabled to true again and see if the issue can be reproduced?

Rei1010 added the kind/bug Something isn't working label Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When devicePlugin.passDeviceSpecsEnabled is set to true, GPU pod failed to start (failed to apply OCI options) #776

When devicePlugin.passDeviceSpecsEnabled is set to true, GPU pod failed to start (failed to apply OCI options) #776

Rei1010 commented Jan 3, 2025 •

edited

Loading

Nimbus318 commented Jan 6, 2025

Rei1010 commented Jan 6, 2025

Nimbus318 commented Jan 6, 2025

When devicePlugin.passDeviceSpecsEnabled is set to true, GPU pod failed to start (failed to apply OCI options) #776

When devicePlugin.passDeviceSpecsEnabled is set to true, GPU pod failed to start (failed to apply OCI options) #776

Comments

Rei1010 commented Jan 3, 2025 • edited Loading

Nimbus318 commented Jan 6, 2025

Rei1010 commented Jan 6, 2025

Nimbus318 commented Jan 6, 2025

Rei1010 commented Jan 3, 2025 •

edited

Loading