diff --git a/managing-odh.adoc b/managing-odh.adoc index 95fc2b63..6a0a62e3 100644 --- a/managing-odh.adoc +++ b/managing-odh.adoc @@ -44,6 +44,7 @@ include::assemblies/customizing-component-deployment-resources.adoc[leveloffset= //add intro text? include::modules/enabling-nvidia-gpus.adoc[leveloffset=+2] include::modules/intel-gaudi-ai-accelerator-integration.adoc[leveloffset=+2] +include::modules/amd-gpu-integration.adoc[leveloffset=+2] //include::modules/enabling-intel-gaudi-ai-accelerators.adoc[leveloffset=+3] == Managing distributed workloads diff --git a/modules/about-base-training-images.adoc b/modules/about-base-training-images.adoc index 65c5b07a..a28b8ae9 100644 --- a/modules/about-base-training-images.adoc +++ b/modules/about-base-training-images.adoc @@ -54,4 +54,4 @@ endif::[] ifndef::upstream[] * Create a custom image that includes the additional libraries or packages. For more information, see link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/managing-custom-training-images_distributed-workloads#creating-a-custom-training-image_distributed-workloads[Creating a custom training image]. -endif::[] +endif::[] \ No newline at end of file diff --git a/modules/about-workbench-images.adoc b/modules/about-workbench-images.adoc index 7c2cb687..e541f80d 100644 --- a/modules/about-workbench-images.adoc +++ b/modules/about-workbench-images.adoc @@ -100,4 +100,13 @@ To use the *CUDA - RStudio Server* workbench image, you must first build it by c The *CUDA - RStudio Server* workbench image contains NVIDIA CUDA technology. CUDA licensing information is available at link:https://docs.nvidia.com/cuda/[https://docs.nvidia.com/cuda/]. Review the licensing terms before you use this sample workbench. ==== endif::[] -|=== + +| ROCm +| Use the ROCm notebook image to run AI and machine learning workloads on AMD GPUs in {productname-short}. It includes ROCm libraries and tools optimized for high-performance GPU acceleration, supporting custom AI workflows and data processing tasks. Use this image integrating additional frameworks or dependencies tailored to your specific AI development needs. + +| ROCm-PyTorch +| Use the ROCm-PyTorch notebook image to optimize PyTorch workloads on AMD GPUs in {productname-short}. It includes ROCm-accelerated PyTorch libraries, enabling efficient deep learning training, inference, and experimentation. This image is designed for data scientists working with PyTorch-based workflows, offering integration with GPU scheduling. + +| ROCm-TensorFlow +| Use the ROCm-TensorFlow notebook image to optimize TensorFlow workloads on AMD GPUs in {productname-short}. It includes ROCm-accelerated TensorFlow libraries to support high-performance deep learning model training and inference. This image simplifies TensorFlow development on AMD GPUs and integrates with {productname-short} for resource scaling and management. +|=== \ No newline at end of file diff --git a/modules/adding-a-model-server-for-the-multi-model-serving-platform.adoc b/modules/adding-a-model-server-for-the-multi-model-serving-platform.adoc index 26f5c9c5..acbba184 100644 --- a/modules/adding-a-model-server-for-the-multi-model-serving-platform.adoc +++ b/modules/adding-a-model-server-for-the-multi-model-serving-platform.adoc @@ -9,13 +9,13 @@ When you have enabled the multi-model serving platform, you must configure a mod ifdef::self-managed[] [NOTE] ==== -In {productname-short} {vernum}, {org-name} supports only NVIDIA GPU accelerators for model serving. +In {productname-short} {vernum}, {org-name} supports only NVIDIA and AMD GPU accelerators for model serving. ==== endif::[] ifdef::cloud-service[] [NOTE] ==== -In {productname-short}, {org-name} supports only NVIDIA GPU accelerators for model serving. +In {productname-short}, {org-name} supports only NVIDIA and AMD GPU accelerators for model serving. ==== endif::[] @@ -31,11 +31,11 @@ endif::[] * You have enabled the multi-model serving platform. ifndef::upstream[] * If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See link:{rhoaidocshome}{default-format-url}/serving_models/serving-small-and-medium-sized-models_model-serving#adding-a-custom-model-serving-runtime-for-the-multi-model-serving-platform_model-serving[Adding a custom model-serving runtime]. -* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs]. +* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. If you use NVIDIA GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]. If you use AMD GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#amd-gpu-integration_managing-rhoai[AMD GPU integration^]. endif::[] ifdef::upstream[] * If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See link:{odhdocshome}/serving-models/#adding-a-custom-model-serving-runtime-for-the-multi-model-serving-platform_model-serving[Adding a custom model-serving runtime]. -* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation. +* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU and AMD GPU Operators. For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation. endif::[] .Procedure @@ -81,4 +81,4 @@ If you are using a _custom_ model-serving runtime with your model server and wan . Optional: To update the model server, click the action menu (*⋮*) beside the model server and select *Edit model server*. //[role="_additional-resources"] -//.Additional resources +//.Additional resources \ No newline at end of file diff --git a/modules/amd-gpu-integration.adoc b/modules/amd-gpu-integration.adoc new file mode 100644 index 00000000..7e4202a4 --- /dev/null +++ b/modules/amd-gpu-integration.adoc @@ -0,0 +1,14 @@ +:_module-type: CONCEPT + +[id='amd-gpu-integration_{context}'] += AMD GPU Integration + +You can use AMD GPUs with {productname-short} to accelerate AI and machine learning (ML) workloads. AMD GPUs provide high-performance compute capabilities, allowing users to process large data sets, train deep neural networks, and perform complex inference tasks more efficiently. + +Integrating AMD GPUs with {productname-short} involves the following components: + +* **ROCm workbench images**: + Use the ROCm workbench images to streamline AI/ML workflows on AMD GPUs. These images include libraries and frameworks optimized with the AMD ROCm platform, enabling high-performance workloads for PyTorch and TensorFlow. The pre-configured images reduce setup time and provide an optimized environment for GPU-accelerated development and experimentation. + +* **AMD GPU Operator**: + The AMD GPU Operator simplifies GPU integration by automating driver installation, device plugin setup, and node labeling for GPU resource management. It ensures compatibility between OpenShift and AMD hardware while enabling scaling of GPU-enabled workloads. diff --git a/modules/configuring-quota-management-for-distributed-workloads.adoc b/modules/configuring-quota-management-for-distributed-workloads.adoc index cc467c95..bfcc0ef5 100644 --- a/modules/configuring-quota-management-for-distributed-workloads.adoc +++ b/modules/configuring-quota-management-for-distributed-workloads.adoc @@ -70,28 +70,27 @@ endif::[] ifndef::upstream[] * If you want to use graphics processing units (GPUs), you have enabled GPU support in {productname-short}. -See link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs]. +If you use NVIDIA GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]. If you use AMD GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#amd-gpu-integration_managing-rhoai[AMD GPU integration^]. + ifdef::self-managed[] [NOTE] ==== -In {productname-short} {vernum}, {org-name} supports only NVIDIA GPU accelerators for distributed workloads. +In {productname-short} {vernum}, {org-name} supports only NVIDIA and AMD GPU accelerators for distributed workloads. ==== endif::[] ifdef::cloud-service[] [NOTE] ==== -In {productname-short}, {org-name} supports only NVIDIA GPU accelerators for distributed workloads. +In {productname-short}, {org-name} supports only NVIDIA and AMD GPU accelerators for distributed workloads. ==== endif::[] endif::[] ifdef::upstream[] * If you want to use graphics processing units (GPUs), you have enabled GPU support. -This process includes installing the Node Feature Discovery Operator and the NVIDIA GPU Operator. +This process includes installing the Node Feature Discovery Operator and the NVIDIA GPU and AMD GPU Operators. For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation. endif::[] - .Procedure . In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example: @@ -132,7 +131,7 @@ metadata: spec: namespaceSelector: {} # match all. resourceGroups: - - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] + - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] # If you use AMD GPUs, substitute "nvidia.com/gpu" with "amd.com/gpu" flavors: - name: "default-flavor" resources: @@ -140,7 +139,7 @@ spec: nominalQuota: 9 - name: "memory" nominalQuota: 36Gi - - name: "nvidia.com/gpu" + - name: "nvidia.com/gpu" # If you use AMD GPUs, substitute "nvidia.com/gpu" with "amd.com/gpu" nominalQuota: 5 ---- + @@ -200,4 +199,4 @@ $ oc get -n ____ localqueues [role='_additional-resources'] .Additional resources -* link:https://kueue.sigs.k8s.io/docs/concepts/[Kueue documentation] +* link:https://kueue.sigs.k8s.io/docs/concepts/[Kueue documentation] \ No newline at end of file diff --git a/modules/creating-a-custom-training-image.adoc b/modules/creating-a-custom-training-image.adoc index fac282a3..54a0ffb7 100644 --- a/modules/creating-a-custom-training-image.adoc +++ b/modules/creating-a-custom-training-image.adoc @@ -63,7 +63,7 @@ FROM quay.io/modh/ray:2.35.0-py39-rocm61 FROM quay.io/modh/ray:2.35.0-py311-rocm61 ---- -* To create a CUDA-compatible KFTO cluster image, specify the CUDA-compatible KFTO base image location: +* To create a CUDA-compatible KFTO cluster image, specify the Developer Preview CUDA-compatible KFTO base image location: + .CUDA-compatible KFTO base image with Python 3.11 [source,bash] @@ -126,5 +126,4 @@ If your new image was created successfully, it is included in the list of images podman push ${IMG}:0.0.1 ---- -. Optional: Make your new image available to other users, as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/managing-custom-training-images_distributed-workloads#pushing-an-image-to-the-integrated-openshift-image-registry_distributed-workloads[Pushing an image to the integrated OpenShift image registry]. - +. Optional: Make your new image available to other users, as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/managing-custom-training-images_distributed-workloads#pushing-an-image-to-the-integrated-openshift-image-registry_distributed-workloads[Pushing an image to the integrated OpenShift image registry]. \ No newline at end of file diff --git a/modules/deploying-models-on-single-node-openshift-using-kserve-raw-deployment-mode.adoc b/modules/deploying-models-on-single-node-openshift-using-kserve-raw-deployment-mode.adoc index 22828fb4..1a766a23 100644 --- a/modules/deploying-models-on-single-node-openshift-using-kserve-raw-deployment-mode.adoc +++ b/modules/deploying-models-on-single-node-openshift-using-kserve-raw-deployment-mode.adoc @@ -23,7 +23,7 @@ Deploying a machine learning model using KServe raw deployment mode on single no * For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket. * To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/built-tip.md#bootstrap-process[Converting Hugging Face Hub models to Caikit format^] in the link:https://github.com/opendatahub-io/caikit-tgis-serving/tree/main[caikit-tgis-serving^] repository. ifndef::upstream[] -* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs]. +* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. If you use NVIDIA GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]. If you use AMD GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#amd-gpu-integration_managing-rhoai[AMD GPU integration^]. * To use the vLLM runtime, you have enabled GPU support in {productname-short} and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/specialized_hardware_and_driver_enablement/psap-node-feature-discovery-operator#installing-the-node-feature-discovery-operator_psap-node-feature-discovery-operator[Installing the Node Feature Discovery operator] and link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs]. endif::[] ifdef::upstream[] @@ -274,4 +274,4 @@ Ensure that you replace ``, ``, ``, ` *Installed Operators* page. Confirm that the following Operators appear: + +* AMD GPU Operator +* Node Feature Discovery (NFD) +* Kernel Module Management (KMM) + +[NOTE] +==== +Ensure that you follow all the steps for proper driver installation and configuration. Incorrect installation or configuration may prevent the AMD GPUs from being recognized or functioning properly. +==== diff --git a/modules/installing-the-distributed-workloads-components.adoc b/modules/installing-the-distributed-workloads-components.adoc index 6a822506..aa8f96c7 100644 --- a/modules/installing-the-distributed-workloads-components.adoc +++ b/modules/installing-the-distributed-workloads-components.adoc @@ -35,12 +35,11 @@ endif::[] ifndef::upstream[] * If you want to use graphics processing units (GPUs), you have enabled GPU support in {productname-short}. -See link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs]. +If you use NVIDIA GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs]. If you use AMD GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#amd-gpu-integration_managing-rhoai[AMD GPU integration^]. + ifdef::self-managed[] [NOTE] ==== -In {productname-short} {vernum}, for distributed workloads, {org-name} supports only NVIDIA GPU accelerators. {org-name} supports the use of accelerators within the same cluster only. {org-name} does not support remote direct memory access (RDMA) between accelerators, or the use of accelerators across a network, for example, by using technology such as NVIDIA GPUDirect or NVLink. ==== @@ -48,7 +47,6 @@ endif::[] ifdef::cloud-service[] [NOTE] ==== -In {productname-short}, for distributed workloads, {org-name} supports only NVIDIA GPU accelerators. {org-name} supports the use of accelerators within the same cluster only. {org-name} does not support remote direct memory access (RDMA) between accelerators, or the use of accelerators across a network, for example, by using technology such as NVIDIA GPUDirect or NVLink. ==== @@ -56,8 +54,8 @@ endif::[] endif::[] ifdef::upstream[] * If you want to use graphics processing units (GPUs), you have enabled GPU support. -This process includes installing the Node Feature Discovery Operator and the NVIDIA GPU Operator. -For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation. +This process includes installing the Node Feature Discovery Operator and the relevant GPU Operator. +For more information, see link:https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] for NVIDIA GPUs and link:https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html[AMD GPU Operator on {org-name} OpenShift Container Platform^] for AMD GPUs. endif::[] ifdef::cloud-service[] @@ -220,4 +218,3 @@ endif::[] ifndef::upstream[] Configure the distributed workloads feature as described in link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/managing_distributed_workloads[Managing distributed workloads]. endif::[] - diff --git a/modules/next-steps-getting-started.adoc b/modules/next-steps-getting-started.adoc index c24bb992..e0dfeab5 100644 --- a/modules/next-steps-getting-started.adoc +++ b/modules/next-steps-getting-started.adoc @@ -72,7 +72,7 @@ ifdef::upstream[] link:{odhdocshome}/working-with-accelerators/[Working with accelerators] endif::[] + -If you work with large data sets, you can use accelerators, such as NVIDIA GPUs and Intel Gaudi AI accelerators, to optimize the performance of your data science models in {productname-short}. With accelerators, you can scale your work, reduce latency, and increase productivity. +If you work with large data sets, you can use accelerators, such as NVIDIA GPUs, AMD GPUs, and Intel Gaudi AI accelerators, to optimize the performance of your data science models in {productname-short}. With accelerators, you can scale your work, reduce latency, and increase productivity. Implement distributed workloads for higher performance:: ifndef::upstream[] diff --git a/modules/overview-of-accelerators.adoc b/modules/overview-of-accelerators.adoc index 8cc03233..a5b701ae 100644 --- a/modules/overview-of-accelerators.adoc +++ b/modules/overview-of-accelerators.adoc @@ -11,14 +11,21 @@ If you work with large data sets, you can use accelerators to optimize the perfo * Training deep neural networks * Data cleansing and data processing -{productname-short} supports the following accelerators: +{productname-short} supports the following accelerators: * NVIDIA graphics processing units (GPUs) -** To use compute-heavy workloads in your models, you can enable NVIDIA graphics processing units (GPUs) in {productname-short}. -** To enable GPUs on OpenShift, you must install the link:https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator]. +** To use compute-heavy workloads in your models, you can enable NVIDIA graphics processing units (GPUs) in {productname-short}. +** To enable NVIDIA GPUs on OpenShift, you must install the link:https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator]. +* AMD graphics processing units (GPUs) +** Use the AMD GPU Operator to enable AMD GPUs for workloads such as AI/ML training and inference. +** To enable AMD GPUs on OpenShift, you must do the following tasks: +*** Install the AMD GPU Operator. +*** Follow the instructions for full deployment and driver configuration in the link:https://dcgpu.docs.amd.com/projects/gpu-operator/en/main/index.html[AMD GPU Operator documentation]. + +** Once installed, the AMD GPU Operator allows you to use the ROCm workbench images to streamline AI/ML workflows on AMD GPUs. * Intel Gaudi AI accelerators ** Intel provides hardware accelerators intended for deep learning workloads. -** Before you can enable Intel Gaudi AI accelerators in {productname-short}, you must install the necessary dependencies. Also, the version of the Intel Gaudi AI Operator that you install must match the version of the corresponding workbench image in your deployment. +** Before you can enable Intel Gaudi AI accelerators in {productname-short}, you must install the necessary dependencies. Also, the version of the Intel Gaudi AI Operator that you install must match the version of the corresponding workbench image in your deployment. ** A workbench image for Intel Gaudi accelerators is not included in {productname-short} by default. Instead, you must create and configure a custom notebook to enable Intel Gaudi AI support. ** You can enable Intel Gaudi AI accelerators on-premises or with AWS DL1 compute nodes on an AWS instance. @@ -27,5 +34,7 @@ Before you can use an accelerator in {productname-short}, your OpenShift instanc [role="_additional-resources"] .Additional resources * link:https://habana.ai/[Habana, an Intel Company] -* link:https://aws.amazon.com/ec2/instance-types/dl1/[Amazon EC2 DL1 Instances] -* link:https://linux.die.net/man/8/lspci[lspci(8) - Linux man page] +* link:https://aws.amazon.com/ec2/instance-types/dl1/[Amazon EC2 DL1 Instances] +* link:https://docs.amd.com/en/solutions/ai-machine-learning/rocm[AMD ROCm Platform Documentation] +* link:https://github.com/ROCm/gpu-operator[AMD GPU Operator on GitHub] +* link:https://linux.die.net/man/8/lspci[lspci(8) - Linux man page] \ No newline at end of file diff --git a/modules/ref-example-kueue-resource-configurations.adoc b/modules/ref-example-kueue-resource-configurations.adoc index e8d422ce..1fe00bf6 100644 --- a/modules/ref-example-kueue-resource-configurations.adoc +++ b/modules/ref-example-kueue-resource-configurations.adoc @@ -1,4 +1,3 @@ - :_module-type: REFERENCE [id='example-kueue-resource-configurations_{context}'] @@ -23,9 +22,7 @@ endif::[] endif::[] -//== NVIDIA GPUs without shared cohort -// When AMD GPUs are supported, uncomment the above line and delete the following line -== NVIDIA GPUs +== NVIDIA GPUs without shared cohort === NVIDIA RTX A400 GPU resource flavor @@ -106,9 +103,6 @@ spec: nominalQuota: 2 ---- -// When AMD GPUs are supported, uncomment the following section - -//// == NVIDIA GPUs and AMD GPUs without shared cohort === AMD GPU resource flavor @@ -195,9 +189,7 @@ spec: nominalQuota: 2 ---- -//// - [role='_additional-resources'] == Additional resources * link:https://kueue.sigs.k8s.io/docs/concepts/resource_flavor/[Resource Flavor] in the Kueue documentation -* link:https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/[Cluster Queue] in the Kueue documentation +* link:https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/[Cluster Queue] in the Kueue documentation \ No newline at end of file diff --git a/modules/running-distributed-data-science-workloads-from-ds-pipelines.adoc b/modules/running-distributed-data-science-workloads-from-ds-pipelines.adoc index d8f631ee..57a87d95 100644 --- a/modules/running-distributed-data-science-workloads-from-ds-pipelines.adoc +++ b/modules/running-distributed-data-science-workloads-from-ds-pipelines.adoc @@ -62,15 +62,7 @@ $ pip install kfp .. Install any other dependencies that are required for your pipeline. .. Build your data science pipeline in Python code. + -For example, create a file named `compile_example.py` with the following content. -ifdef::upstream[] -+ -[NOTE] --- -If you copy and paste the following code example, remember to remove the _callouts_, which are not part of the code. -The callouts (parenthetical numbers, highlighted in bold font in this document) map the relevant line of code to an explanatory note in the text immediately after the code example. --- -endif::[] +For example, if you use NVIDIA GPUs, create a file named `compile_example.py` with the following content: + [source,Python] ---- @@ -87,7 +79,7 @@ def ray_fn(): import ray <1> from codeflare_sdk import Cluster, ClusterConfiguration, generate_cert <2> - + # If you do not use NVIDIA GPUs, substitute “nvidia.com/gpu” with the correct value for your accelerator cluster = Cluster( <3> ClusterConfiguration( namespace="my_project", <4> @@ -225,4 +217,4 @@ ifdef::upstream[] * link:{odhdocshome}/working-with-data-science-pipelines/[Working with data science pipelines] endif::[] -* link:https://docs.ray.io/en/latest/cluster/getting-started.html[Ray Clusters documentation] +* link:https://docs.ray.io/en/latest/cluster/getting-started.html[Ray Clusters documentation] \ No newline at end of file diff --git a/modules/running-the-demo-notebooks-from-the-codeflare-sdk.adoc b/modules/running-the-demo-notebooks-from-the-codeflare-sdk.adoc index ceb6fbc5..facca8da 100644 --- a/modules/running-the-demo-notebooks-from-the-codeflare-sdk.adoc +++ b/modules/running-the-demo-notebooks-from-the-codeflare-sdk.adoc @@ -134,8 +134,7 @@ If you omit this line, one of the following Ray cluster images is used by defaul The default Ray images are compatible with NVIDIA GPUs that are supported by CUDA 12.1. The default images are AMD64 images, which might not work on other architectures. -Additional ROCm-compatible Ray cluster images are available. -These images are compatible with AMD accelerators that are supported by ROCm 6.1. +Additional ROCm-compatible Ray cluster images are compatible with AMD accelerators that are supported by ROCm 6.1. These images are AMD64 images, which might not work on other architectures. ifndef::upstream[] @@ -203,4 +202,4 @@ endif::[] * link:https://url[link text] -//// +//// \ No newline at end of file diff --git a/modules/starting-a-jupyter-notebook-server.adoc b/modules/starting-a-jupyter-notebook-server.adoc index de6211a6..f44b4bce 100644 --- a/modules/starting-a-jupyter-notebook-server.adoc +++ b/modules/starting-a-jupyter-notebook-server.adoc @@ -51,10 +51,10 @@ When a new version of a notebook image is released, the previous version remains [IMPORTANT] -- ifdef::upstream[] -Using accelerators is only supported with specific notebook images. For GPUs, only the PyTorch, TensorFlow, and CUDA notebook images are supported. In addition, you can only specify the number of accelerators required for your notebook server if accelerators are enabled on your cluster. +Using accelerators is only supported with specific notebook images. For GPUs, only the AMD ROCm, PyTorch, TensorFlow, and CUDA notebook images are supported. In addition, you can only specify the number of accelerators required for your notebook server if accelerators are enabled on your cluster. endif::[] ifndef::upstream[] -Using accelerators is only supported with specific notebook images. For GPUs, only the PyTorch, TensorFlow, and CUDA notebook images are supported. In addition, you can only specify the number of accelerators required for your notebook server if accelerators are enabled on your cluster. To learn how to enable accelerator support, see link:{rhoaidocshome}{default-format-url}/working_with_accelerators/overview-of-accelerators_accelerators[Working with accelerators]. +Using accelerators is only supported with specific notebook images. For GPUs, only the AMD ROCm, PyTorch, TensorFlow, and CUDA notebook images are supported. In addition, you can only specify the number of accelerators required for your notebook server if accelerators are enabled on your cluster. To learn how to enable accelerator support, see link:{rhoaidocshome}{default-format-url}/working_with_accelerators/overview-of-accelerators_accelerators[Working with accelerators]. endif::[] -- .. Optional: Select and specify values for any new *Environment variables*. @@ -81,4 +81,4 @@ After the server starts, you see one of the following behaviors: * The JupyterLab interface opens. .Troubleshooting -* If you see the "Unable to load notebook server configuration options" error message, contact your administrator so that they can review the logs associated with your Jupyter pod and determine further details about the problem. +* If you see the "Unable to load notebook server configuration options" error message, contact your administrator so that they can review the logs associated with your Jupyter pod and determine further details about the problem. \ No newline at end of file diff --git a/modules/verifying-amd-gpu-availability-on-your-cluster.adoc b/modules/verifying-amd-gpu-availability-on-your-cluster.adoc new file mode 100644 index 00000000..3c56da60 --- /dev/null +++ b/modules/verifying-amd-gpu-availability-on-your-cluster.adoc @@ -0,0 +1,66 @@ +:_module-type: PROCEDURE + +[id="verifying-amd-gpu-availability-on-your-cluster_{context}"] += Verifying AMD GPU availability on your cluster + +[role='_abstract'] +Before you proceed with the AMD GPU Operator installation process, you can verify the presence of an AMD GPU device on a node within your {openshift-platform} cluster. You can use commands such as `lspci` or `oc` to confirm hardware and resource availability. + +.Prerequisites +* You have administrative access to the {openshift-platform} cluster. +* You have a running {openshift-platform} cluster with a node equipped with an AMD GPU. +* You have access to the OpenShift CLI (`oc`) and terminal access to the node. + +.Procedure +. Use the OpenShift CLI to verify if GPU resources are allocatable: +.. List all nodes in the cluster to identify the node with an AMD GPU: ++ +---- +oc get nodes +---- +.. Note the name of the node where you expect the AMD GPU to be present. +.. Describe the node to check its resource allocation: ++ +---- +oc describe node +---- +.. In the output, locate the **Capacity** and **Allocatable** sections and confirm that `amd.com/gpu` is listed. For example: ++ +---- +Capacity: + amd.com/gpu: 1 +Allocatable: + amd.com/gpu: 1 +---- +. Check for the AMD GPU device using the `lspci` command: +.. Log in to the node: ++ +---- +oc debug node/ +chroot /host +---- +.. Run the `lspci` command and search for the supported AMD device in your deployment. For example: ++ +---- +lspci | grep -E "MI210|MI250|MI300" +---- +.. Verify that the output includes one of the AMD GPU models. For example: ++ +---- +03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD] Instinct MI210 +---- +. Optional: Use the `rocminfo` command if the ROCm stack is installed on the node: ++ +---- +rocminfo +---- +.. Confirm that the ROCm tool outputs details about the AMD GPU, such as compute units, memory, and driver status. + +.Verification +* The `oc describe node ` command lists `amd.com/gpu` under **Capacity** and **Allocatable**. +* The `lspci` command output identifies an AMD GPU as a PCI device matching one of the specified models (for example, MI210, MI250, MI300). +* Optional: The `rocminfo` tool provides detailed GPU information, confirming driver and hardware configuration. + +[role="_additional-resources"] +.Additional resources +* link:https://github.com/ROCm/gpu-operator[AMD GPU Operator GitHub Repository] diff --git a/working-with-accelerators.adoc b/working-with-accelerators.adoc index 9d90f6cf..1a32deaf 100644 --- a/working-with-accelerators.adoc +++ b/working-with-accelerators.adoc @@ -16,7 +16,7 @@ include::_artifacts/document-attributes-global.adoc[] = Working with accelerators -Use accelerators, such as NVIDIA GPUs and Intel Gaudi AI accelerators, to optimize the performance of your end-to-end data science workflows. +Use accelerators, such as NVIDIA GPUs, AMD GPUs, and Intel Gaudi AI accelerators, to optimize the performance of your end-to-end data science workflows. //Overview of accelerators include::modules/overview-of-accelerators.adoc[leveloffset=+1] @@ -32,6 +32,14 @@ include::modules/intel-gaudi-ai-accelerator-integration.adoc[leveloffset=+1] //include::modules/enabling-intel-gaudi-ai-accelerators.adoc[leveloffset=+2] +//AMD GPUs + +include::modules/amd-gpu-integration.adoc[leveloffset=+1] + +include::modules/verifying-amd-gpu-availability-on-your-cluster.adoc[leveloffset=+2] + +include::modules/enabling-amd-gpus.adoc[leveloffset=+2] + //Using accelerator profiles include::modules/working-with-accelerator-profiles.adoc[leveloffset=+1] @@ -46,6 +54,3 @@ include::modules/viewing-accelerator-profiles.adoc[leveloffset=+2] include::modules/configuring-a-recommended-accelerator-for-notebook-images.adoc[leveloffset=+2] include::modules/configuring-a-recommended-accelerator-for-serving-runtimes.adoc[leveloffset=+2] - - -