Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENG-4392 - Initial changes for ODH AMD GPU docs #561

Merged
merged 4 commits into from
Jan 31, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions managing-odh.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ include::assemblies/customizing-component-deployment-resources.adoc[leveloffset=
//add intro text?
include::modules/enabling-nvidia-gpus.adoc[leveloffset=+2]
include::modules/intel-gaudi-ai-accelerator-integration.adoc[leveloffset=+2]
include::modules/amd-gpu-integration.adoc[leveloffset=+2]
//include::modules/enabling-intel-gaudi-ai-accelerators.adoc[leveloffset=+3]

== Managing distributed workloads
Expand Down
2 changes: 1 addition & 1 deletion modules/about-base-training-images.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,4 @@ endif::[]
ifndef::upstream[]
* Create a custom image that includes the additional libraries or packages.
For more information, see link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/managing-custom-training-images_distributed-workloads#creating-a-custom-training-image_distributed-workloads[Creating a custom training image].
endif::[]
endif::[]
11 changes: 10 additions & 1 deletion modules/about-workbench-images.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -100,4 +100,13 @@ To use the *CUDA - RStudio Server* workbench image, you must first build it by c
The *CUDA - RStudio Server* workbench image contains NVIDIA CUDA technology. CUDA licensing information is available at link:https://docs.nvidia.com/cuda/[https://docs.nvidia.com/cuda/]. Review the licensing terms before you use this sample workbench.
====
endif::[]
|===

| ROCm
| Use the ROCm notebook image to run AI and machine learning workloads on AMD GPUs in {productname-short}. It includes ROCm libraries and tools optimized for high-performance GPU acceleration, supporting custom AI workflows and data processing tasks. Use this image integrating additional frameworks or dependencies tailored to your specific AI development needs.

| ROCm-PyTorch
| Use the ROCm-PyTorch notebook image to optimize PyTorch workloads on AMD GPUs in {productname-short}. It includes ROCm-accelerated PyTorch libraries, enabling efficient deep learning training, inference, and experimentation. This image is designed for data scientists working with PyTorch-based workflows, offering integration with GPU scheduling.

| ROCm-TensorFlow
| Use the ROCm-TensorFlow notebook image to optimize TensorFlow workloads on AMD GPUs in {productname-short}. It includes ROCm-accelerated TensorFlow libraries to support high-performance deep learning model training and inference. This image simplifies TensorFlow development on AMD GPUs and integrates with {productname-short} for resource scaling and management.
|===
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@ When you have enabled the multi-model serving platform, you must configure a mod
ifdef::self-managed[]
[NOTE]
====
In {productname-short} {vernum}, {org-name} supports only NVIDIA GPU accelerators for model serving.
In {productname-short} {vernum}, {org-name} supports only NVIDIA and AMD GPU accelerators for model serving.
====
endif::[]
ifdef::cloud-service[]
[NOTE]
====
In {productname-short}, {org-name} supports only NVIDIA GPU accelerators for model serving.
In {productname-short}, {org-name} supports only NVIDIA and AMD GPU accelerators for model serving.
====
endif::[]

Expand All @@ -31,11 +31,11 @@ endif::[]
* You have enabled the multi-model serving platform.
ifndef::upstream[]
* If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See link:{rhoaidocshome}{default-format-url}/serving_models/serving-small-and-medium-sized-models_model-serving#adding-a-custom-model-serving-runtime-for-the-multi-model-serving-platform_model-serving[Adding a custom model-serving runtime].
* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs].
* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. If you use NVIDIA GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]. If you use AMD GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#amd-gpu-integration_managing-rhoai[AMD GPU integration^].
endif::[]
ifdef::upstream[]
* If you want to use a custom model-serving runtime for your model server, you have added and enabled the runtime. See link:{odhdocshome}/serving-models/#adding-a-custom-model-serving-runtime-for-the-multi-model-serving-platform_model-serving[Adding a custom model-serving runtime].
* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation.
* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU and AMD GPU Operators. For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation.
endif::[]

.Procedure
Expand Down Expand Up @@ -81,4 +81,4 @@ If you are using a _custom_ model-serving runtime with your model server and wan
. Optional: To update the model server, click the action menu (*⋮*) beside the model server and select *Edit model server*.

//[role="_additional-resources"]
//.Additional resources
//.Additional resources
14 changes: 14 additions & 0 deletions modules/amd-gpu-integration.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
:_module-type: CONCEPT

[id='amd-gpu-integration_{context}']
= AMD GPU Integration

You can use AMD GPUs with {productname-short} to accelerate AI and machine learning (ML) workloads. AMD GPUs provide high-performance compute capabilities, allowing users to process large data sets, train deep neural networks, and perform complex inference tasks more efficiently.

Integrating AMD GPUs with {productname-short} involves the following components:

* **ROCm workbench images**:
Use the ROCm workbench images to streamline AI/ML workflows on AMD GPUs. These images include libraries and frameworks optimized with the AMD ROCm platform, enabling high-performance workloads for PyTorch and TensorFlow. The pre-configured images reduce setup time and provide an optimized environment for GPU-accelerated development and experimentation.

* **AMD GPU Operator**:
The AMD GPU Operator simplifies GPU integration by automating driver installation, device plugin setup, and node labeling for GPU resource management. It ensures compatibility between OpenShift and AMD hardware while enabling scaling of GPU-enabled workloads.
Original file line number Diff line number Diff line change
Expand Up @@ -70,28 +70,27 @@ endif::[]

ifndef::upstream[]
* If you want to use graphics processing units (GPUs), you have enabled GPU support in {productname-short}.
See link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs].
If you use NVIDIA GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]. If you use AMD GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#amd-gpu-integration_managing-rhoai[AMD GPU integration^].
+
ifdef::self-managed[]
[NOTE]
====
In {productname-short} {vernum}, {org-name} supports only NVIDIA GPU accelerators for distributed workloads.
In {productname-short} {vernum}, {org-name} supports only NVIDIA and AMD GPU accelerators for distributed workloads.
====
endif::[]
ifdef::cloud-service[]
[NOTE]
====
In {productname-short}, {org-name} supports only NVIDIA GPU accelerators for distributed workloads.
In {productname-short}, {org-name} supports only NVIDIA and AMD GPU accelerators for distributed workloads.
====
endif::[]
endif::[]
ifdef::upstream[]
* If you want to use graphics processing units (GPUs), you have enabled GPU support.
This process includes installing the Node Feature Discovery Operator and the NVIDIA GPU Operator.
This process includes installing the Node Feature Discovery Operator and the NVIDIA GPU and AMD GPU Operators.
For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation.
endif::[]


.Procedure

. In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
Expand Down Expand Up @@ -132,15 +131,15 @@ metadata:
spec:
namespaceSelector: {} # match all.
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"] # If you use AMD GPUs, substitute "nvidia.com/gpu" with "amd.com/gpu"
flavors:
- name: "default-flavor"
resources:
- name: "cpu"
nominalQuota: 9
- name: "memory"
nominalQuota: 36Gi
- name: "nvidia.com/gpu"
- name: "nvidia.com/gpu" # If you use AMD GPUs, substitute "nvidia.com/gpu" with "amd.com/gpu"
nominalQuota: 5
----
+
Expand Down Expand Up @@ -200,4 +199,4 @@ $ oc get -n __<project-name>__ localqueues

[role='_additional-resources']
.Additional resources
* link:https://kueue.sigs.k8s.io/docs/concepts/[Kueue documentation]
* link:https://kueue.sigs.k8s.io/docs/concepts/[Kueue documentation]
5 changes: 2 additions & 3 deletions modules/creating-a-custom-training-image.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ FROM quay.io/modh/ray:2.35.0-py39-rocm61
FROM quay.io/modh/ray:2.35.0-py311-rocm61
----

* To create a CUDA-compatible KFTO cluster image, specify the CUDA-compatible KFTO base image location:
* To create a CUDA-compatible KFTO cluster image, specify the Developer Preview CUDA-compatible KFTO base image location:
+
.CUDA-compatible KFTO base image with Python 3.11
[source,bash]
Expand Down Expand Up @@ -126,5 +126,4 @@ If your new image was created successfully, it is included in the list of images
podman push ${IMG}:0.0.1
----

. Optional: Make your new image available to other users, as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/managing-custom-training-images_distributed-workloads#pushing-an-image-to-the-integrated-openshift-image-registry_distributed-workloads[Pushing an image to the integrated OpenShift image registry].

. Optional: Make your new image available to other users, as described in link:{rhoaidocshome}{default-format-url}/working_with_distributed_workloads/managing-custom-training-images_distributed-workloads#pushing-an-image-to-the-integrated-openshift-image-registry_distributed-workloads[Pushing an image to the integrated OpenShift image registry].
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Deploying a machine learning model using KServe raw deployment mode on single no
* For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.
* To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/built-tip.md#bootstrap-process[Converting Hugging Face Hub models to Caikit format^] in the link:https://github.com/opendatahub-io/caikit-tgis-serving/tree/main[caikit-tgis-serving^] repository.
ifndef::upstream[]
* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs].
* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. If you use NVIDIA GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]. If you use AMD GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#amd-gpu-integration_managing-rhoai[AMD GPU integration^].
* To use the vLLM runtime, you have enabled GPU support in {productname-short} and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/specialized_hardware_and_driver_enablement/psap-node-feature-discovery-operator#installing-the-node-feature-discovery-operator_psap-node-feature-discovery-operator[Installing the Node Feature Discovery operator] and link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs].
endif::[]
ifdef::upstream[]
Expand Down Expand Up @@ -274,4 +274,4 @@ Ensure that you replace `<namespace>`, `<pod-name>`, `<local_port>`, `<remote_po
* Use your preferred client library or tool to send requests to the `localhost` inference URL.

// [role="_additional-resources"]
// .Additional resources
// .Additional resources
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@ endif::[]
* For the model that you want to deploy, you know the associated folder path in your S3-compatible object storage bucket.
* To use the Caikit-TGIS runtime, you have converted your model to Caikit format. For an example, see link:https://github.com/opendatahub-io/caikit-tgis-serving/blob/main/demo/kserve/built-tip.md#bootstrap-process[Converting Hugging Face Hub models to Caikit format^] in the link:https://github.com/opendatahub-io/caikit-tgis-serving/tree/main[caikit-tgis-serving^] repository.
ifndef::upstream[]
* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. See link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^].
* To use the *vLLM ServingRuntime for KServe* runtime, you have enabled GPU support in {productname-short} and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/specialized_hardware_and_driver_enablement/psap-node-feature-discovery-operator#installing-the-node-feature-discovery-operator_psap-node-feature-discovery-operator[Installing the Node Feature Discovery operator] and link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]
* If you want to use graphics processing units (GPUs) with your model server, you have enabled GPU support in {productname-short}. If you use NVIDIA GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]. If you use AMD GPUs, see link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#amd-gpu-integration_managing-rhoai[AMD GPU integration^].
* To use the vLLM runtime, you have enabled GPU support in {productname-short} and have installed and configured the Node Feature Discovery operator on your cluster. For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/specialized_hardware_and_driver_enablement/psap-node-feature-discovery-operator#installing-the-node-feature-discovery-operator_psap-node-feature-discovery-operator[Installing the Node Feature Discovery operator] and link:{rhoaidocshome}{default-format-url}/managing_openshift_ai/enabling_accelerators#enabling-nvidia-gpus_managing-rhoai[Enabling NVIDIA GPUs^]
endif::[]
ifdef::upstream[]
* To use the *vLLM ServingRuntime for KServe* runtime or use graphics processing units (GPUs) with your model server, you have enabled GPU support. This includes installing the Node Feature Discovery and NVIDIA GPU Operators. For more information, see https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator on {org-name} OpenShift Container Platform^] in the NVIDIA documentation.
Expand All @@ -65,14 +65,14 @@ ifdef::self-managed[]
+
[NOTE]
====
In {productname-short} {vernum}, {org-name} supports NVIDIA GPU, AMD GPU and Intel Gaudi accelerators for model serving.
In {productname-short} {vernum}, {org-name} supports NVIDIA GPU, Intel Gaudi, and AMD GPU accelerators for model serving.
====
endif::[]
ifdef::cloud-service[]
+
[NOTE]
====
In {productname-short}, {org-name} supports NVIDIA GPU, AMD GPU and Intel Gaudi accelerators for model serving.
In {productname-short}, {org-name} supports NVIDIA GPU, Intel Gaudi, and AMD GPU accelerators for model serving.
====
endif::[]
* To deploy {rhelai-productname-short} models:
Expand Down Expand Up @@ -146,4 +146,4 @@ NOTE: Do not modify the port or model serving runtime arguments, because they re
* Confirm that the deployed model is shown on the *Models* tab for the project, and on the *Model Serving* page of the dashboard with a checkmark in the *Status* column.

// [role="_additional-resources"]
// .Additional resources
// .Additional resources
52 changes: 52 additions & 0 deletions modules/enabling-amd-gpus.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
:_module-type: PROCEDURE

[id='enabling-amd-gpus_{context}']
= Enabling AMD GPUs

[role='_abstract']
Before you can use AMD GPUs in {productname-short}, you must install the required dependencies, deploy the AMD GPU Operator, and configure the environment.

.Prerequisites
ifdef::upstream,self-managed[]
* You have logged in to {openshift-platform}.
* You have the `cluster-admin` role in {openshift-platform}.
* You have installed your AMD GPU and confirmed that it is detected in your environment.
* Your {openshift-platform} environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
endif::[]
ifdef::cloud-service[]
* You have logged in to OpenShift.
* You have the `cluster-admin` role in OpenShift.
* You have installed your AMD GPU and confirmed that it is detected in your environment.
* Your {openshift-platform} environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
endif::[]

.Procedure
. Install the latest version of the AMD GPU Operator, as described in link:https://dcgpu.docs.amd.com/projects/gpu-operator/en/main/installation/openshift-olm.html[Install AMD GPU Operator on OpenShift].
. After installing the AMD GPU Operator, configure the AMD drivers required by the Operator as described in the documentation: link:https://dcgpu.docs.amd.com/projects/gpu-operator/en/main/drivers/installation.html[Configure AMD drivers for the GPU Operator].

[NOTE]
====
Alternatively, you can install the AMD GPU Operator from the {org-name} Catalog. For more information, see link:https://catalog.redhat.com/software/container-stacks/detail/6722781e65e61b6d4caccef8?rh-tabs-2b5yslu8z=rh-tab-v8le4ijlp[Install AMD GPU Operator from Red Hat Catalog].
====

//downstream - all
ifndef::upstream[]
. After installing the AMD GPU Operator, create an accelerator profile, as described in link:{rhoaidocshome}{default-format-url}/working_with_accelerators/#working-with-accelerator-profiles_accelerators[Working with accelerator profiles].
endif::[]

//upstream only
ifdef::upstream[]
. After installing the AMD GPU Operator, create an accelerator profile, as described in link:{odhdocshome}/working-with-accelerators/#working-with-accelerator-profiles_accelerators[Working with accelerator profiles].
endif::[]

.Verification
From the *Administrator* perspective, go to the *Operators* -> *Installed Operators* page. Confirm that the following Operators appear:

* AMD GPU Operator
* Node Feature Discovery (NFD)
* Kernel Module Management (KMM)

[NOTE]
====
Ensure that you follow all the steps for proper driver installation and configuration. Incorrect installation or configuration may prevent the AMD GPUs from being recognized or functioning properly.
====
Loading