opendatahub-io · chtyler · Jan 31, 2025 · Jan 17, 2025 · Jan 17, 2025 · Jan 20, 2025
diff --git a/managing-odh.adoc b/managing-odh.adoc
@@ -44,6 +44,7 @@ include::assemblies/customizing-component-deployment-resources.adoc[leveloffset=
 //add intro text?
 include::modules/enabling-nvidia-gpus.adoc[leveloffset=+2]
 include::modules/intel-gaudi-ai-accelerator-integration.adoc[leveloffset=+2]
+include::modules/amd-gpu-integration.adoc[leveloffset=+2]
 //include::modules/enabling-intel-gaudi-ai-accelerators.adoc[leveloffset=+3]
 
 == Managing distributed workloads

diff --git a/modules/about-base-training-images.adoc b/modules/about-base-training-images.adoc
@@ -54,4 +54,4 @@ endif::[]
 ifndef::upstream[]
 * Create a custom image that includes the additional libraries or packages. 
 For more information, see link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/managing-custom-training-images_distributed-workloads#creating-a-custom-training-image_distributed-workloads[Creating a custom training image].
-endif::[]
+endif::[]
diff --git a/modules/about-workbench-images.adoc b/modules/about-workbench-images.adoc
@@ -100,4 +100,13 @@ To use the *CUDA - RStudio Server* workbench image, you must first build it by c
 The *CUDA - RStudio Server* workbench image contains NVIDIA CUDA technology. CUDA licensing information is available at link:https://docs.nvidia.com/cuda/[https://docs.nvidia.com/cuda/]. Review the licensing terms before you use this sample workbench.
 ====
 endif::[]
-|===
+
+| ROCm
+| Use the ROCm notebook image to run AI and machine learning workloads on AMD GPUs in {productname-short}. It includes ROCm libraries and tools optimized for high-performance GPU acceleration, supporting custom AI workflows and data processing tasks. Use this image integrating additional frameworks or dependencies tailored to your specific AI development needs.
+
+| ROCm-PyTorch
+| Use the ROCm-PyTorch notebook image to optimize PyTorch workloads on AMD GPUs in {productname-short}. It includes ROCm-accelerated PyTorch libraries, enabling efficient deep learning training, inference, and experimentation. This image is designed for data scientists working with PyTorch-based workflows, offering integration with GPU scheduling.
+
+| ROCm-TensorFlow
+| Use the ROCm-TensorFlow notebook image to optimize TensorFlow workloads on AMD GPUs in {productname-short}. It includes ROCm-accelerated TensorFlow libraries to support high-performance deep learning model training and inference. This image simplifies TensorFlow development on AMD GPUs and integrates with {productname-short} for resource scaling and management.
+|===
diff --git a/modules/adding-a-model-server-for-the-multi-model-serving-platform.adoc b/modules/adding-a-model-server-for-the-multi-model-serving-platform.adoc
@@ -9,13 +9,13 @@ When you have enabled the multi-model serving platform, you must configure a mod
 ifdef::self-managed[]
 [NOTE]
 ====
-In {productname-short} {vernum}, {org-name} supports only NVIDIA GPU accelerators for model serving.
+In {productname-short} {vernum}, {org-name} supports only NVIDIA and AMD GPU accelerators for model serving.
 ====
 endif::[]
 ifdef::cloud-service[]
 [NOTE]
 ====
-In {productname-short}, {org-name} supports only NVIDIA GPU accelerators for model serving.
+In {productname-short}, {org-name} supports only NVIDIA and AMD GPU accelerators for model serving.
 ====
 endif::[]
 
@@ -31,11 +31,11 @@ endif::[]
 * You have enabled the multi-model serving platform.
 ifndef::upstream[]
 * If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See link:{rhoaidocshome}{default-format-url}/serving_models/serving-small-and-medium-sized-models_model-serving#adding-a-custom-model-serving-runtime-for-the-multi-model-serving-platform_model-serving[Adding a custom model-serving runtime].
-* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs].
+* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. If you use NVIDIA GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]. If you use AMD GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#amd-gpu-integration_managing-rhoai[AMD GPU integration^].
 endif::[]
 ifdef::upstream[]
 * If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See link:{odhdocshome}/serving-models/#adding-a-custom-model-serving-runtime-for-the-multi-model-serving-platform_model-serving[Adding a custom model-serving runtime].
-* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation.
+* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU and AMD GPU Operators. For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation.
 endif::[]
 
 .Procedure
@@ -81,4 +81,4 @@ If you are using a _custom_ model-serving runtime with your model server and wan
 . Optional: To update the model server, click the action menu (*&#8942;*) beside the model server and select *Edit model server*.
 
 //[role="_additional-resources"]
-//.Additional resources
+//.Additional resources
diff --git a/modules/amd-gpu-integration.adoc b/modules/amd-gpu-integration.adoc
@@ -0,0 +1,14 @@
+:_module-type: CONCEPT
+
+[id='amd-gpu-integration_{context}']
+= AMD GPU Integration
+
+You can use AMD GPUs with {productname-short} to accelerate AI and machine learning (ML) workloads. AMD GPUs provide high-performance compute capabilities, allowing users to process large data sets, train deep neural networks, and perform complex inference tasks more efficiently.
+
+Integrating AMD GPUs with {productname-short} involves the following components:
+
+* **ROCm workbench images**: 
+  Use the ROCm workbench images to streamline AI/ML workflows on AMD GPUs. These images include libraries and frameworks optimized with the AMD ROCm platform, enabling high-performance workloads for PyTorch and TensorFlow. The pre-configured images reduce setup time and provide an optimized environment for GPU-accelerated development and experimentation.
+
+* **AMD GPU Operator**: 
+  The AMD GPU Operator simplifies GPU integration by automating driver installation, device plugin setup, and node labeling for GPU resource management. It ensures compatibility between OpenShift and AMD hardware while enabling scaling of GPU-enabled workloads.
diff --git a/modules/configuring-quota-management-for-distributed-workloads.adoc b/modules/configuring-quota-management-for-distributed-workloads.adoc
@@ -70,28 +70,27 @@ endif::[]
 
 ifndef::upstream[]
 * If you want to use graphics processing units (GPUs), you have enabled GPU support in {productname-short}.
-See link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs].
+If you use NVIDIA GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]. If you use AMD GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#amd-gpu-integration_managing-rhoai[AMD GPU integration^].
 +
 ifdef::self-managed[]
 [NOTE]
 ====
-In {productname-short} {vernum}, {org-name} supports only NVIDIA GPU accelerators for distributed workloads.
+In {productname-short} {vernum}, {org-name} supports only NVIDIA and AMD GPU accelerators for distributed workloads.
 ====
 endif::[]
 ifdef::cloud-service[]
 [NOTE]
 ====
-In {productname-short}, {org-name} supports only NVIDIA GPU accelerators for distributed workloads.
+In {productname-short}, {org-name} supports only NVIDIA and AMD GPU accelerators for distributed workloads.
 ====
 endif::[]
 endif::[]
 ifdef::upstream[]
 * If you want to use graphics processing units (GPUs), you have enabled GPU support.
-This process includes installing the Node Feature Discovery Operator and the NVIDIA GPU Operator.
+This process includes installing the Node Feature Discovery Operator and the NVIDIA GPU and AMD GPU Operators.
 For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation.
 endif::[]
 
-
 .Procedure
 
 . In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
@@ -132,15 +131,15 @@ metadata:
 spec:
   namespaceSelector: {}  # match all.
   resourceGroups:
-  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
+  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] # If you use AMD GPUs, substitute "nvidia.com/gpu" with "amd.com/gpu"
     flavors:
     - name: "default-flavor"
       resources:
       - name: "cpu"
         nominalQuota: 9
       - name: "memory"
         nominalQuota: 36Gi
-      - name: "nvidia.com/gpu"
+      - name: "nvidia.com/gpu" # If you use AMD GPUs, substitute "nvidia.com/gpu" with "amd.com/gpu"
         nominalQuota: 5
 ----
 +
@@ -200,4 +199,4 @@ $ oc get -n __<project-name>__ localqueues
 
 [role='_additional-resources']
 .Additional resources
-* link:https://kueue.sigs.k8s.io/docs/concepts/[Kueue documentation]
+* link:https://kueue.sigs.k8s.io/docs/concepts/[Kueue documentation]
diff --git a/modules/creating-a-custom-training-image.adoc b/modules/creating-a-custom-training-image.adoc
@@ -63,7 +63,7 @@ FROM quay.io/modh/ray:2.35.0-py39-rocm61
 FROM quay.io/modh/ray:2.35.0-py311-rocm61
 ----
 
-* To create a CUDA-compatible KFTO cluster image, specify the CUDA-compatible KFTO base image location:
+* To create a CUDA-compatible KFTO cluster image, specify the Developer Preview CUDA-compatible KFTO base image location:
 +
 .CUDA-compatible KFTO base image with Python 3.11
 [source,bash]
@@ -126,5 +126,4 @@ If your new image was created successfully, it is included in the list of images
 podman push ${IMG}:0.0.1
 ----
 
-. Optional: Make your new image available to other users, as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/managing-custom-training-images_distributed-workloads#pushing-an-image-to-the-integrated-openshift-image-registry_distributed-workloads[Pushing an image to the integrated OpenShift image registry].
-
+. Optional: Make your new image available to other users, as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/managing-custom-training-images_distributed-workloads#pushing-an-image-to-the-integrated-openshift-image-registry_distributed-workloads[Pushing an image to the integrated OpenShift image registry].
diff --git a/...deploying-models-on-single-node-openshift-using-kserve-raw-deployment-mode.adoc b/...deploying-models-on-single-node-openshift-using-kserve-raw-deployment-mode.adoc
@@ -23,7 +23,7 @@ Deploying a machine learning model using KServe raw deployment mode on single no
 * For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.
 * To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/built-tip.md#bootstrap-process[Converting Hugging Face Hub models to Caikit format^] in the link:https://github.com/opendatahub-io/caikit-tgis-serving/tree/main[caikit-tgis-serving^] repository.
 ifndef::upstream[]
-* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs].
+* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. If you use NVIDIA GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]. If you use AMD GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#amd-gpu-integration_managing-rhoai[AMD GPU integration^].
 * To use the vLLM runtime, you have enabled GPU support in {productname-short} and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/specialized_hardware_and_driver_enablement/psap-node-feature-discovery-operator#installing-the-node-feature-discovery-operator_psap-node-feature-discovery-operator[Installing the Node Feature Discovery operator] and link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs].
 endif::[]
 ifdef::upstream[]
@@ -274,4 +274,4 @@ Ensure that you replace `<namespace>`, `<pod-name>`, `<local_port>`, `<remote_po
 * Use your preferred client library or tool to send requests to the `localhost` inference URL.
 
 // [role="_additional-resources"]
-// .Additional resources
+// .Additional resources
diff --git a/modules/deploying-models-on-the-single-model-serving-platform.adoc b/modules/deploying-models-on-the-single-model-serving-platform.adoc
@@ -43,8 +43,8 @@ endif::[]
 * For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.
 * To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/built-tip.md#bootstrap-process[Converting Hugging Face Hub models to Caikit format^] in the link:https://github.com/opendatahub-io/caikit-tgis-serving/tree/main[caikit-tgis-serving^] repository.
 ifndef::upstream[]
-* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^].
-* To use the *vLLM ServingRuntime for KServe* runtime, you have enabled GPU support in {productname-short} and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/specialized_hardware_and_driver_enablement/psap-node-feature-discovery-operator#installing-the-node-feature-discovery-operator_psap-node-feature-discovery-operator[Installing the Node Feature Discovery operator] and link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]
+* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. If you use NVIDIA GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]. If you use AMD GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#amd-gpu-integration_managing-rhoai[AMD GPU integration^].
+* To use the vLLM runtime, you have enabled GPU support in {productname-short} and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/specialized_hardware_and_driver_enablement/psap-node-feature-discovery-operator#installing-the-node-feature-discovery-operator_psap-node-feature-discovery-operator[Installing the Node Feature Discovery operator] and link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]
 endif::[]
 ifdef::upstream[]
 * To use the *vLLM ServingRuntime for KServe* runtime or use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation.
@@ -65,14 +65,14 @@ ifdef::self-managed[]
 +
 [NOTE]
 ====
-In {productname-short} {vernum}, {org-name} supports NVIDIA GPU, AMD GPU and Intel Gaudi accelerators for model serving.
+In {productname-short} {vernum}, {org-name} supports NVIDIA GPU, Intel Gaudi, and AMD GPU accelerators for model serving.
 ====
 endif::[]
 ifdef::cloud-service[]
 +
 [NOTE]
 ====
-In {productname-short}, {org-name} supports NVIDIA GPU, AMD GPU and Intel Gaudi accelerators for model serving.
+In {productname-short}, {org-name} supports NVIDIA GPU, Intel Gaudi, and AMD GPU accelerators for model serving.
 ====
 endif::[]
 *  To deploy {rhelai-productname-short} models:
@@ -146,4 +146,4 @@ NOTE: Do not modify the port or model serving runtime arguments, because they re
 * Confirm that the deployed model is shown on the *Models* tab for the project, and on the *Model Serving* page of the dashboard with a checkmark in the *Status* column.
 
 // [role="_additional-resources"]
-// .Additional resources
+// .Additional resources
diff --git a/modules/enabling-amd-gpus.adoc b/modules/enabling-amd-gpus.adoc
@@ -0,0 +1,52 @@
+:_module-type: PROCEDURE
+
+[id='enabling-amd-gpus_{context}']
+= Enabling AMD GPUs
+
+[role='_abstract']
+Before you can use AMD GPUs in {productname-short}, you must install the required dependencies, deploy the AMD GPU Operator, and configure the environment.
+
+.Prerequisites
+ifdef::upstream,self-managed[]
+* You have logged in to {openshift-platform}.
+* You have the `cluster-admin` role in {openshift-platform}.
+* You have installed your AMD GPU and confirmed that it is detected in your environment.
+* Your {openshift-platform} environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
+endif::[]
+ifdef::cloud-service[]
+* You have logged in to OpenShift.
+* You have the `cluster-admin` role in OpenShift.
+* You have installed your AMD GPU and confirmed that it is detected in your environment.
+* Your {openshift-platform} environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
+endif::[]
+
+.Procedure
+. Install the latest version of the AMD GPU Operator, as described in link:https://dcgpu.docs.amd.com/projects/gpu-operator/en/main/installation/openshift-olm.html[Install AMD GPU Operator on OpenShift].
+. After installing the AMD GPU Operator, configure the AMD drivers required by the Operator as described in the documentation: link:https://dcgpu.docs.amd.com/projects/gpu-operator/en/main/drivers/installation.html[Configure AMD drivers for the GPU Operator].
+
+[NOTE]
+====
+Alternatively, you can install the AMD GPU Operator from the {org-name} Catalog. For more information, see link:https://catalog.redhat.com/software/container-stacks/detail/6722781e65e61b6d4caccef8?rh-tabs-2b5yslu8z=rh-tab-v8le4ijlp[Install AMD GPU Operator from Red Hat Catalog].
+====
+
+//downstream - all
+ifndef::upstream[]
+. After installing the AMD GPU Operator, create an accelerator profile, as described in link:{rhoaidocshome}{default-format-url}/working_with_accelerators/#working-with-accelerator-profiles_accelerators[Working with accelerator profiles].
+endif::[]
+
+//upstream only
+ifdef::upstream[]
+. After installing the AMD GPU Operator, create an accelerator profile, as described in link:{odhdocshome}/working-with-accelerators/#working-with-accelerator-profiles_accelerators[Working with accelerator profiles].
+endif::[]
+
+.Verification
+From the *Administrator* perspective, go to the *Operators* -> *Installed Operators* page. Confirm that the following Operators appear:
+
+* AMD GPU Operator
+* Node Feature Discovery (NFD)
+* Kernel Module Management (KMM)
+
+[NOTE]
+====
+Ensure that you follow all the steps for proper driver installation and configuration. Incorrect installation or configuration may prevent the AMD GPUs from being recognized or functioning properly.
+====