[DRAFT] - ENG-4292 - Initial changes for ODH AMD GPU docs #561

chtyler · 2024-11-25T18:34:06Z

Description

These are the changes to document the introduction of the AMD GPUs GA at 2.16.1 for ODH. AMD have provided all of the in-depth enablement instructions for deploying the AMD GPU Operator on RHOAI. Therefore, we link out to the AMD instructions where and when possible. In some areas, we have needed to add our own content.

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

…overing these

…everal modules and correct some formatting issues and content errors

aduquett

Minor comments, looks great overall!

aduquett · 2025-01-20T15:33:31Z

modules/about-base-training-images.adoc

+ifndef::upstream[]
+| Ray ROCm
+endif::[]
+ifdef::upstream[]
 | Ray ROCm 
 | If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the Ray ROCm base image to gain access to the AMD ROCm software stack. 
 Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs. 


Is this missing an endif::[] here?

aduquett · 2025-01-20T15:50:38Z

modules/amd-gpu-integration.adoc

+
+You can use AMD GPUs with {productname-short} to accelerate AI and machine learning (ML) workloads. AMD GPUs provide high-performance compute capabilities, allowing users to process large data sets, train deep neural networks, and perform complex inference tasks more efficiently.
+
+Integrating AMD GPUs with {productname-short} involves the following components and benefits:


Suggestion (from comment on line 16):
Integrating AMD GPUs with {productname-short} involves the following components:

aduquett · 2025-01-20T15:53:21Z

modules/amd-gpu-integration.adoc

+* **AMD GPU Operator**: 
+  The AMD GPU Operator simplifies GPU integration by automating driver installation, device plugin setup, and node labeling for GPU resource management. It ensures compatibility between OpenShift and AMD hardware while enabling scaling of GPU-enabled workloads.
+
+* **Why Use AMD GPUs with {productname-short}?**


This third bullet doesn't really flow with the first two. Suggestion:
Integrating AMD GPUs with {productname-short} provides the following benefits:

I'd delete this whole section, the intro paragraph ("AMD GPUs provide high-performance compute capabilities, [...]") is imo sufficient.

aduquett · 2025-01-20T15:58:53Z

modules/enabling-amd-gpus.adoc

+* Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
+endif::[]
+ifdef::cloud-service[]
+* You have logged in to OpenShift.
+* You have the `cluster-admin` role in OpenShift.
+* You have installed your AMD GPU and confirmed that it is detected in your environment.
+* Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).


Could use {openshift-platform} instead of OpenShift

aduquett · 2025-01-20T16:01:04Z

modules/enabling-amd-gpus.adoc

+endif::[]
+
+.Procedure
+. Install version 1.1.1 of the AMD GPU Operator, as described in link:https://dcgpu.docs.amd.com/projects/gpu-operator/en/main/installation/openshift-olm.html[Install AMD GPU Operator on OpenShift].


Is it important to specify the version here? If not, it might be easier to leave it out to avoid having to update it later. Or create an attribute.

aduquett · 2025-01-20T16:08:12Z

modules/overview-of-accelerators.adoc

+** To enable NVIDIA GPUs on OpenShift, you must install the link:https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator].
+* AMD graphics processing units (GPUs)
+** Use the AMD GPU Operator to enable AMD GPUs for workloads such as AI/ML training and inference.
+** To enable AMD GPUs on OpenShift, you must:


** To enable AMD GPUs on {openshift-platform}, you must do the following tasks:

aduquett · 2025-01-20T16:13:09Z

modules/verifying-amd-gpu-availability-on-your-cluster.adoc

+* You have a running OpenShift cluster with a node equipped with an AMD GPU.
+* You have access to the OpenShift CLI (`oc`) and terminal access to the node.
+
+.Procedure
+. Use the OpenShift CLI to verify if GPU resources are allocatable:


Could use {openshift-platform} instead of OpenShift

aduquett · 2025-01-20T16:16:17Z

modules/verifying-amd-gpu-availability-on-your-cluster.adoc

+----
+oc describe node <node_name>
+----
+.. In the output, locate the **Capacity** and **Allocatable** sections and confirm that `amd.com/gpu` is listed. Example output:


Suggestion: For example: instead of Example output: for consistency.

aduquett · 2025-01-20T16:16:52Z

modules/verifying-amd-gpu-availability-on-your-cluster.adoc

+----
+lspci | grep -E "MI210|MI250|MI300"
+----
+.. Verify that the output includes one of the AMD GPU models. Example:


Suggestion: Use For example: instead of Example:

aduquett · 2025-01-20T16:19:43Z

modules/verifying-amd-gpu-availability-on-your-cluster.adoc

+
+.Verification
+* The `oc describe node <node_name>` command lists `amd.com/gpu` under **Capacity** and **Allocatable**.
+* The `lspci` command output identifies an AMD GPU as a PCI device matching one of the specified models (e.g., MI210, MI250, MI300).


Per Style Guide, do not use e.g - replace with:
(for example, MI210, MI250, MI300)

aduquett · 2025-01-20T16:23:24Z

modules/overview-of-accelerators.adoc

+* AMD graphics processing units (GPUs)
+** Use the AMD GPU Operator to enable AMD GPUs for workloads such as AI/ML training and inference.
+** To enable AMD GPUs on OpenShift, you must:
+*** Install version 1.1.1 of the AMD GPU Operator.


Is it important to specify the version here? If not, it might be easier to leave it out to avoid having to update it later. Or create an attribute.

jiridanek · 2025-01-20T16:31:51Z

modules/overview-of-accelerators.adoc

+* AMD graphics processing units (GPUs)
+** Use the AMD GPU Operator to enable AMD GPUs for workloads such as AI/ML training and inference.
+** To enable AMD GPUs on OpenShift, you must:
+*** Install version 1.1.1 of the AMD GPU Operator.


hardcoded operator version?

chtyler force-pushed the ENG-4392-amd-gpus-docs branch from 06044ee to 7355f4b Compare January 9, 2025 13:28

chtyler force-pushed the ENG-4392-amd-gpus-docs branch from 7355f4b to 844091f Compare January 17, 2025 15:09

chtyler added 3 commits January 20, 2025 15:23

rebasing

901211d

Removing modules that are no longer needed due to AMD documentation c…

16bc432

…overing these

final changes to AMD PR before peer review, improved the wording in s…

bd012fa

…everal modules and correct some formatting issues and content errors

chtyler force-pushed the ENG-4392-amd-gpus-docs branch from 5264ee9 to bd012fa Compare January 20, 2025 15:23

chtyler changed the title ~~DRAFT - Do not merge: ENG-4292 - Initial changes for ODH AMD GPU docs~~ DRAFT - ENG-4292 - Initial changes for ODH AMD GPU docs Jan 20, 2025

aduquett reviewed Jan 20, 2025

View reviewed changes

jiridanek reviewed Jan 20, 2025

View reviewed changes

chtyler marked this pull request as draft January 20, 2025 16:35

chtyler changed the title ~~DRAFT - ENG-4292 - Initial changes for ODH AMD GPU docs~~ [DRAFT] - ENG-4292 - Initial changes for ODH AMD GPU docs Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] - ENG-4292 - Initial changes for ODH AMD GPU docs #561

[DRAFT] - ENG-4292 - Initial changes for ODH AMD GPU docs #561

chtyler commented Nov 25, 2024 •

edited

Loading

aduquett left a comment

aduquett Jan 20, 2025

aduquett Jan 20, 2025 •

edited

Loading

chtyler Jan 30, 2025

aduquett Jan 20, 2025

jiridanek Jan 20, 2025

aduquett Jan 20, 2025

chtyler Jan 30, 2025

aduquett Jan 20, 2025

aduquett Jan 20, 2025

aduquett Jan 20, 2025

aduquett Jan 20, 2025

aduquett Jan 20, 2025

aduquett Jan 20, 2025

aduquett Jan 20, 2025

jiridanek Jan 20, 2025


		You can use AMD GPUs with {productname-short} to accelerate AI and machine learning (ML) workloads. AMD GPUs provide high-performance compute capabilities, allowing users to process large data sets, train deep neural networks, and perform complex inference tasks more efficiently.

		Integrating AMD GPUs with {productname-short} involves the following components and benefits:

[DRAFT] - ENG-4292 - Initial changes for ODH AMD GPU docs #561

Are you sure you want to change the base?

[DRAFT] - ENG-4292 - Initial changes for ODH AMD GPU docs #561

Conversation

chtyler commented Nov 25, 2024 • edited Loading

Description

How Has This Been Tested?

Merge criteria:

aduquett left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aduquett Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chtyler commented Nov 25, 2024 •

edited

Loading

aduquett Jan 20, 2025 •

edited

Loading