Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_dev.sh failing #163

Open
Flipsack opened this issue Dec 12, 2024 · 10 comments
Open

run_dev.sh failing #163

Flipsack opened this issue Dec 12, 2024 · 10 comments
Assignees
Labels
bug Something isn't working documentation Improvements or additions to documentation verify to close Waiting on confirm issue is resolved

Comments

@Flipsack
Copy link

I'm trying to use isaac_ros_common

Upon executing ./run_dev.sh I get the follwing error:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices nvidia.com/gpu=all, nvidia.com/pva=all: unknown.

Here is the full, verbose, output:

$ ./run_dev.sh -v
Launching Isaac ROS Dev container with image key aarch64.ros2_humble: /home/nvidia/workspaces/isaac_ros-dev/
Building aarch64.ros2_humble base as image: isaac_ros_dev-aarch64
Building layered image for key aarch64.ros2_humble as isaac_ros_dev-aarch64
Using configured docker search paths: /home/nvidia/workspaces/isaac_ros-dev/src/isaac_ros_common/scripts/../docker
Checking if base image nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a exists on remote registry
Found pre-built base image: nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a
aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a: Pulling from nvidia/isaac/ros
Digest: sha256:69b1a8b4373fce2a57ab656cd7c7e2a714f685cfd62168418caeaa216d4315a0
Status: Image is up to date for nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a
nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a
Finished pulling pre-built base image: nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a
Nothing to build, retagged nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a as isaac_ros_dev-aarch64
Running isaac_ros_dev-aarch64-container
+ docker run -it --rm --privileged --network host --ipc=host -v /tmp/.X11-unix:/tmp/.X11-unix -v /home/nvidia/.Xauthority:/home/admin/.Xauthority:rw -e DISPLAY -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all -e ROS_DOMAIN_ID -e USER -e ISAAC_ROS_WS=/workspaces/isaac_ros-dev -e HOST_USER_UID=1000 -e HOST_USER_GID=1000 -e NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all,nvidia.com/pva=all -v /usr/bin/tegrastats:/usr/bin/tegrastats -v /tmp/:/tmp/ -v /usr/lib/aarch64-linux-gnu/tegra:/usr/lib/aarch64-linux-gnu/tegra -v /usr/src/jetson_multimedia_api:/usr/src/jetson_multimedia_api --pid=host -v /usr/share/vpi3:/usr/share/vpi3 -v /dev/input:/dev/input -v /run/jtop.sock:/run/jtop.sock:ro -v /home/nvidia/workspaces/isaac_ros-dev/:/workspaces/isaac_ros-dev -v /etc/localtime:/etc/localtime:ro --name isaac_ros_dev-aarch64-container --runtime nvidia --entrypoint /usr/local/bin/scripts/workspace-entrypoint.sh --workdir /workspaces/isaac_ros-dev isaac_ros_dev-aarch64 /bin/bash
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices nvidia.com/gpu=all, nvidia.com/pva=all: unknown.
+ cleanup
+ for command in "${ON_EXIT[@]}"
+ popd
~/workspaces/isaac_ros-dev/src/isaac_ros_common/scripts

Setup:

  • Hardware: Jetson Orin Nano Dev Kit
  • Software: Jetpack 6.1

Thanks for any help with this issue

@hemalshahNV
Copy link
Contributor

Thanks for raising this. We had seen this on pre-release builds of Jetpack 6..0 and VPI 3.2.1 which required running the pre-requisite steps listed here. These steps should not have been necessary with a Jetpack 6.1 machine, however, and we had not seen it since. Could you run these steps on your Jetson and let us know if that resolves it for you? If so, we'll update the instructions and the troubleshooting while we try to reproduce the issue on our end here too.

@hemalshahNV hemalshahNV self-assigned this Dec 13, 2024
@hemalshahNV hemalshahNV added documentation Improvements or additions to documentation verify to close Waiting on confirm issue is resolved labels Dec 13, 2024
@Flipsack
Copy link
Author

Thanks for your feedback. I was not successful in running the steps in the provided link. Here is what I have done so far:

Setting up the Jetson from scratch

Since my initial post, I reinstalled Jetpack. I did this because I was afraid that some other installations I had done on the system could somehow be interfering with building the image. I have done the following:

  1. I reinstalled Jetpack 6.1 (rev. 1) using the SDK Manager. I installed Jetpack directly on the SSD on the Jetson.
  2. After installation I installed nvidia-jetpack using apt: sudo apt install nvidia-jetpack
  3. Running jtop provides the following information about my installation
    • Libraries
      • CUDA: 12.6.68
      • cuDNN: 9.3.0.75
      • TensorRT: 10.3.0.30
      • VPI: 3.2.4
      • Vulkan: 1.3.204
      • OpenCV: 4.8.0 with CUDA: NO
    • Hardware
      • Model: NVIDIA Jetson Orin Nano Developer Kit
      • Module: NVIDIA Jetson Orin Nano (Developer Kit)
      • L4T: 36.4.2
      • Jetpack: MISSING
        • I belive the fact that it says "MISSING" is related to the following warning when I execute jtop [WARN] jetson-stats not supported for [L4T 36.4.2].
        • Running dpkg -l | grep nvidia-jetpack yields
          • ii nvidia-jetpack 6.1+b123 arm64 NVIDIA Jetpack Meta Package
          • ii nvidia-jetpack-dev 6.1+b123 arm64 NVIDIA Jetpack dev Meta Package
          • ii nvidia-jetpack-runtime 6.1+b123 arm64 NVIDIA Jetpack runtime Meta Package

These are the only steps I have done in terms of setting up the system which are not directly related to Isaac

Isaac setup

I set up Isaac according to this: https://nvidia-isaac-ros.github.io/getting_started/dev_env_setup.html .

I skipped parts of Step 1, related to how to move docker over to the SSD, since my installation already is on the SSD. However, after my first failure of building ros2-isaac-common on the new JetPack installation I have since added "default-runtime": "nvidia", to the daemon.json file. Also, as for testing docker SSD I was not able to pull docker pull nvcr.io/nvidia/l4t-base:r36.4.2. This might be expected, I just wanted to try this version since it matches my current version of L4T. However, pulling docker pull nvcr.io/nvidia/l4t-base:r35.2.1 as described in these docs worked.

I followed step 2 - 4 and sat up the workspace under ~/workspaces/isaac_ros-dev/src since my installation is on a SSD and not a SD card.

isaac_ros_common

I cloned isaac_ros_common repo into ~/workspaces/isaac_ros-dev/src/ with the following command: git clone -b release-3.2 https://github.com/NVIDIA-ISAAC-ROS/isaac_ros_common.git isaac_ros_common. Right afterwards I ran run_dev.sh.

This yielded the following error:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices nvidia.com/gpu=all, nvidia.com/pva=all: unknown.

Running pre-requisite steps:

nvidia-smi returns the following:

$ nvidia-smi
Tue Dec 17 14:41:35 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.4.0                Driver Version: 540.4.0      CUDA Version: 12.6     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Orin (nvgpu)                  N/A  | N/A              N/A |                  N/A |
| N/A   N/A  N/A               N/A /  N/A | Not Supported        |     N/A          N/A |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

According to this it seems like the CUDA driver is recognized, considering that CUDA Version says 12.6. I found it worrisome that no stats are available for the GPU. However, according to some googling, it seems like this is to be expected on Jetson devices?

Checking nvidia-container yields:

$ nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.14.2
commit: 807c87e057e13fbd559369b8fd722cc7a6f4e5bb

To me, this looks good. However, running nvidia-ctk gives:

$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
INFO[0000] Auto-detected mode as "nvml"                 
ERRO[0000] failed to generate CDI spec: failed to create device CDI specs: failed to generate CDI edits for GPU devices: error visiting device: failed to get edits for device: failed to create device discoverer: error getting GPU device minor number: Not Supported 

Since this did not work I did check whether a yaml file already existed for cdi and found the following:

$ cat /etc/cdi/nvidia-pva.yaml 
---
cdiVersion: 0.5.0
containerEdits:
  mounts:
  - containerPath: /run/nvidia-pva-allowd
    hostPath: /run/nvidia-pva-allowd
    options:
    - ro
    - nosuid
    - nodev
    - bind
  hooks:
  - path: /usr/bin/nvidia-pva-hook
    hookName: createContainer
    args:
    - nvidia-pva-hook
    - -d
    - /etc/pva/allow.d
    - create
  - path: /usr/bin/nvidia-pva-allow
    hookName: createContainer
    args:
    - nvidia-pva-allow
    - update
  - path: /usr/bin/nvidia-pva-hook
    hookName: poststop
    args:
    - nvidia-pva-hook
    - -d
    - /etc/pva/allow.d
    - remove
  - path: /usr/bin/nvidia-pva-allow
    hookName: poststop
    args:
    - nvidia-pva-allow
    - update
devices:
- name: "0"
  containerEdits:
    env:
    - NVIDIA_PVA_DEVICE=0
- name: all
  containerEdits:
    env:
    - NVIDIA_PVA_DEVICE=all
kind: nvidia.com/pva

I therefore proceeded with the next step with the hope that previous yaml file would be sufficient, however, I was met with another error:

$ sudo nvidia-ctk runtime configure --runtime=docker --cdi.enabled=true
Incorrect Usage: flag provided but not defined: -cdi.enabled

NAME:
   NVIDIA Container Toolkit CLI runtime configure - Add a runtime to the specified container engine

USAGE:
   NVIDIA Container Toolkit CLI runtime configure [command options] [arguments...]

OPTIONS:
   --dry-run                                          update the runtime configuration as required but don't write changes to disk (default: false)
   --runtime value                                    the target runtime engine; one of [containerd, crio, docker] (default: "docker")
   --config value                                     path to the config file for the target runtime
   --config-mode value                                the config mode for runtimes that support multiple configuration mechanisms
   --oci-hook-path value                              the path to the OCI runtime hook to create if --config-mode=oci-hook is specified. If no path is specified, the generated hook is output to STDOUT.
                                                      Note: The use of OCI hooks is deprecated.
   --nvidia-runtime-name value                        specify the name of the NVIDIA runtime that will be added (default: "nvidia")
   --nvidia-runtime-path value, --runtime-path value  specify the path to the NVIDIA runtime executable (default: "nvidia-container-runtime")
   --nvidia-runtime-hook-path value                   specify the path to the NVIDIA Container Runtime hook executable (default: "/usr/bin/nvidia-container-runtime-hook")
   --nvidia-set-as-default, --set-as-default          set the NVIDIA runtime as the default runtime (default: false)
   --help, -h                                         show help (default: false)
   
ERRO[0000] flag provided but not defined: -cdi.enabled

Conclusion

I was not able to fix the issue using the steps from your previous comment. Is there something else that I'm missing? Thank you for your assistance

@mickey13
Copy link

I was able to start the container after commenting out the below line in run_dev.sh (line 243) as a temporary workaround.

DOCKER_ARGS+=("-e NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all,nvidia.com/pva=all")

@figkim
Copy link

figkim commented Dec 18, 2024

+1, I'm experiencing exactly same steps and same issue w/ @Flipsack

Setup:
Hardware: Jetson Orin AGX Dev Kit 64GB
Software: Jetpack 6.1

@Flipsack
Copy link
Author

I was able to build the image when commenting out the line that @mickey13 was suggesting

@The-Real-Thisas
Copy link

Able to replicate.

# R36 (release), REVISION: 4.0, GCID: 37537400, BOARD: generic, EABI: aarch64, DATE: Fri Sep 13 04:36:44 UTC 2024
# KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia
nvidia@nvidia-desktop:/mnt/nova_ssd/workspaces/isaac_ros-dev/src/isaac_ros_common$ cd ${ISAAC_ROS_WS}/src/isaac_ros_common && ./scripts/run_dev.sh
Launching Isaac ROS Dev container with image key aarch64.ros2_humble: /mnt/nova_ssd/workspaces/isaac_ros-dev/
Building aarch64.ros2_humble base as image: isaac_ros_dev-aarch64
Building layered image for key aarch64.ros2_humble as isaac_ros_dev-aarch64
Using configured docker search paths: /mnt/nova_ssd/workspaces/isaac_ros-dev/src/isaac_ros_common/scripts/../docker
Checking if base image nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a exists on remote registry
Found pre-built base image: nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a
aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a: Pulling from nvidia/isaac/ros
Digest: sha256:69b1a8b4373fce2a57ab656cd7c7e2a714f685cfd62168418caeaa216d4315a0
Status: Image is up to date for nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a
nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a
Finished pulling pre-built base image: nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a
Nothing to build, retagged nvcr.io/nvidia/isaac/ros:aarch64-ros2_humble_deaea1a392d5c02f76be3f4651f4b65a as isaac_ros_dev-aarch64
Running isaac_ros_dev-aarch64-container
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices nvidia.com/gpu=all, nvidia.com/pva=all: unknown.
/mnt/nova_ssd/workspaces/isaac_ros-dev/src/isaac_ros_common

Able to fix by removing CDI by following @mickey13

@beniaminopozzan
Copy link

@Flipsack I followed your steps as #163 (comment) and encoutered the same errors.

To me, this looks good. However, running nvidia-ctk gives:
$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
INFO[0000] Auto-detected mode as "nvml"
ERRO[0000] failed to generate CDI spec: failed to create device CDI specs: failed to generate CDI edits for GPU devices: error visiting device: failed to get edits for device: failed to create device discoverer: error getting GPU device minor number: Not Supported

This issue was reported and solved here: https://forums.developer.nvidia.com/t/podman-gpu-on-jetson-agx-orin/297734/10?u=development7. The fix is to force csv format:

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml --mode=csv

Is still got the same error:

$ sudo nvidia-ctk runtime configure --runtime=docker --cdi.enabled=true
Incorrect Usage: flag provided but not defined: -cdi.enabled

on the next step but this time ./run_dev.sh worked

@hemalshahNV hemalshahNV added the bug Something isn't working label Jan 2, 2025
@hemalshahNV
Copy link
Contributor

This may be because of an older version of NVIDIA Container Toolkit (see here on how to update to at least 1.16). It is possible the JetPack upgrade from 6.0 to 6.1 did not update the NCT for you I suppose. That should resolve this without any workarounds and keep PVA accessible within the dev container as intended.

Alternatively, I had run into the same issues listed here on pre-release JP6.0 and was able to get things mostly working on NVIDIA Container Toolkit 1.14 (could not upgrade to 1.16 because the update list was too extensive) using the --mode=csv workaround AND adding the following to my /etc/docker/daemon.json (restart docker daemon):

{ "features": { "cdi": true }, "cdi-spec-dirs": ["/etc/cdi/", "/var/run/cdi"] }

@ripdk12
Copy link

ripdk12 commented Jan 3, 2025

Same issue here on Jetson Orin Nano Devkit.

@hemalshahNV I was unable to install anything newer than NCT 14.2 on my Jetson following your link, even after configuring experimental packages.

@beniaminopozzan After following your fix ./run_dev.sh worked for me as well.

@beniaminopozzan
Copy link

I'm also unable to update NCT above 14.2 on a Jetson Orin NX flashed with JP6.1(rev1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation verify to close Waiting on confirm issue is resolved
Projects
None yet
Development

No branches or pull requests

7 participants