Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When devicePlugin.passDeviceSpecsEnabled is set to true, GPU pod failed to start (failed to apply OCI options) #776

Open
Rei1010 opened this issue Jan 3, 2025 · 3 comments
Labels
kind/bug Something isn't working

Comments

@Rei1010
Copy link
Collaborator

Rei1010 commented Jan 3, 2025

What happened:
Install HAMi helm with default options, GPU pod failed to run with error "Error: failed to generate container "180e8893b52ae58994b7db777ee79513fbf6555e10d49cb686f94e8f88666074" spec: failed to apply OCI options: lstat /dev/nvidiactl: no such file or directory"

(

passDeviceSpecsEnabled: true
)

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
Just install HAMi with latest code and it reproduced

Anything else we need to know?:

Test yaml:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:22.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 1 # declare how many physical GPUs the pod needs
          nvidia.com/gpumem: 3000 # identifies 3000M GPU memory each physical GPU allocates to the pod (Optional,Integer)
          nvidia.com/gpucores: 30 # identifies 30% GPU GPU core each physical GPU allocates to the pod (Optional,Integer)

Enable passDeviceSpecsEnabled
image

Disable passDeviceSpecsEnabled
image

Related PR:
#690

Environment:

  • HAMi version:
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:
@Rei1010 Rei1010 added the kind/bug Something isn't working label Jan 3, 2025
@Nimbus318
Copy link
Contributor

Hi @Rei1010 ,

To help us diagnose the issue, could you please:

Check NVIDIA devices: Run ls -l /dev/nvidia* on your machine to list the NVIDIA device files.

Test PyTorch GPU access: With passDeviceSpecsEnabled: false, create a GPUPod using an image with PyTorch. Inside the pod, run:

$ python
>>> import torch
>>> torch.cuda.is_available()

This will help us understand the problem better. Thanks!

@Rei1010
Copy link
Collaborator Author

Rei1010 commented Jan 6, 2025

Hi @Nimbus318 ,

Helm with disable passDeviceSpecsEnabled:

helm  get values -n hami-system hami
USER-SUPPLIED VALUES:
devicePlugin:
  passDeviceSpecsEnabled: false

GPU Devics:

root@gpu-master:~# lspci | grep -i nvidia
0b:00.0 3D controller: NVIDIA Corporation GP104GL [Tesla P4] (rev a1)
jovyan@gpu-pod1:/$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Jan  6 01:42 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jan  6 01:42 /dev/nvidiactl
crw-rw-rw- 1 root root 234,   0 Jan  6 01:42 /dev/nvidia-uvm
crw-rw-rw- 1 root root 234,   1 Jan  6 01:43 /dev/nvidia-uvm-tools

Pytorch:

jovyan@gpu-pod1:/$ python
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
[HAMI-core Msg(52:139839174941568:libvgpu.c:836)]: Initializing.....
>>> torch.cuda.is_available()
[HAMI-core Msg(52:139839174941568:libvgpu.c:855)]: Initialized
True

Nvidia-smi:

[HAMI-core Msg(27:140061859383104:libvgpu.c:836)]: Initializing.....
Mon Jan  6 02:56:55 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.5     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       On  | 00000000:0B:00.0 Off |                    0 |
| N/A   35C    P8               6W /  75W |      0MiB /   300MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
[HAMI-core Msg(27:140061859383104:multiprocess_memory_limit.c:497)]: Calling exit handler 27

Run CUDA vectorAdd

root@gpu-pod1:/# ./cuda-samples/vectorAdd
[Vector addition of 50000 elements]
[HAMI-core Msg(42:140071454576640:libvgpu.c:836)]: Initializing.....
[HAMI-core Msg(42:140071454576640:libvgpu.c:855)]: Initialized
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
[HAMI-core Msg(42:140071454576640:multiprocess_memory_limit.c:497)]: Calling exit handler 42

@Nimbus318
Copy link
Contributor

Thanks for your report! Could you please try setting passDeviceSpecsEnabled to true again and see if the issue can be reproduced?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants