Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple NVIDIA GPUs #901

Open
chris-gputrader opened this issue Jan 17, 2025 · 2 comments
Open

Support for multiple NVIDIA GPUs #901

chris-gputrader opened this issue Jan 17, 2025 · 2 comments

Comments

@chris-gputrader
Copy link

I'm encountering an issue when attempting to use the sysbox runtime with containers that require NVIDIA GPU access on a system with multiple GPUs. While the setup works seamlessly on a single-GPU machine, it fails when deployed on a multiple GPU machine.

The container should have access to all or specific GPUs as defined in the Docker Compose file, with GPU devices and drivers properly passed through by the sysbox runtime.

When deploying a container on the multi-GPU system, the following error occurs:

Failed to deploy a stack: compose up operation failed: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: container_linux.go:439: starting container process caused: process_linux.go:608: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: mount error: mount operation failed: /var/lib/docker/overlay2/e5409caee5c762014641d9a3fa7981fc960b3c2309980dda0e6b5d87b096a649/merged/proc/driver/nvidia: no such file or directory: unknown

@ctalledo
Copy link
Member

Hi @chris-gputrader, thanks for reporting.

What distro and kernel are you on?

lsb_release -a
uname -a

Also, can you provide the output of the sysbox-mgr (journalctl -u sysbox-mgr), particularly the first ~10 lines after it starts:

Jan 24 21:57:17 lenovo systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Jan 24 21:57:17 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="Starting ..."
Jan 24 21:57:17 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="Sysbox data root: /var/lib/sysbox"
Jan 24 21:57:18 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="Shiftfs module found in kernel: no"
Jan 24 21:57:18 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="Shiftfs works properly: no"
Jan 24 21:57:18 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="Shiftfs-on-overlayfs works properly: no"
Jan 24 21:57:18 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="ID-mapped mounts supported by kernel: yes"
Jan 24 21:57:18 lenovo sysbox-mgr[1017186]: time="2025-01-24 21:57:17" level=info msg="Overlayfs on ID-mapped mounts supported by kernel: yes"
...

I want to see if shiftfs and/or ID-mapped mounts are working.

Thanks!

@JerFree
Copy link

JerFree commented Jan 25, 2025

I had the same issue.

root@ecs-19330674:~#  docker run  --gpus all --runtime=sysbox-runc --rm -it --hostname=syscont nestybox/ubuntu-bionic-systemd 
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: container_linux.go:439: starting container process caused: process_linux.go:608: container init caused: Running hook #0:: error running hook: '
nvidia-container-cli: mount error: mount operation failed: /var/lib/docker/overlay2/6e296a8ed6217cbc604509faa29a9ee7533f6ec5e71fda80632dea66c09f808f/merged/proc/driver/nvidia: no such file or directory: unknown.

root@ecs-19330674:~# journalctl -u sysbox-mgr
Jan 26 01:10:19 ecs-19330674 systemd[1]: Starting sysbox-mgr (part of the Sysbox container runtime)...
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Starting ..."
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Sysbox data root: /var/lib/sysbox"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Shiftfs module found in kernel: yes"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Shiftfs works properly: no"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Shiftfs-on-overlayfs works properly: yes"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="ID-mapped mounts supported by kernel: yes"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Overlayfs on ID-mapped mounts supported by kernel: no"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Operating in system container mode."
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Relaxed read-only mode disabled."
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Inner container image preloading enabled."
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Listening on /run/sysbox/sysmgr.sock"
Jan 26 01:10:20 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:10:20" level=info msg="Ready ..."
Jan 26 01:10:20 ecs-19330674 systemd[1]: Started sysbox-mgr (part of the Sysbox container runtime).
Jan 26 01:12:17 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:12:17" level=info msg="registered new container c00103f71e59"
Jan 26 01:13:53 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:13:53" level=info msg="unregistered container c00103f71e59"
Jan 26 01:13:53 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:13:53" level=info msg="released resources for container c00103f71e59"
Jan 26 01:14:29 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:14:29" level=info msg="registered new container a9b940e80df7"
Jan 26 01:14:30 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:14:30" level=info msg="unregistered container a9b940e80df7"
Jan 26 01:14:30 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:14:30" level=info msg="released resources for container a9b940e80df7"
Jan 26 01:15:43 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:15:43" level=info msg="registered new container 33d34a93d64c"
Jan 26 01:27:47 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:27:47" level=info msg="unregistered container 33d34a93d64c"
Jan 26 01:27:48 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 01:27:48" level=info msg="released resources for container 33d34a93d64c"
Jan 26 04:58:26 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 04:58:26" level=info msg="registered new container 029902e180f1"
Jan 26 04:59:40 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 04:59:40" level=info msg="registered new container 01db8efcb24d"
Jan 26 05:00:14 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 05:00:14" level=info msg="registered new container 3e1560640e8a"
Jan 26 05:00:15 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 05:00:15" level=info msg="unregistered container 3e1560640e8a"
Jan 26 05:00:15 ecs-19330674 sysbox-mgr[117638]: time="2025-01-26 05:00:15" level=info msg="released resources for container 3e1560640e8a"

thanks .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants