Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] - ENG-4292 - Initial changes for ODH AMD GPU docs #561

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

chtyler
Copy link
Contributor

@chtyler chtyler commented Nov 25, 2024

Description

These are the changes to document the introduction of the AMD GPUs GA at 2.16.1 for ODH. AMD have provided all of the in-depth enablement instructions for deploying the AMD GPU Operator on RHOAI. Therefore, we link out to the AMD instructions where and when possible. In some areas, we have needed to add our own content.

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@chtyler chtyler force-pushed the ENG-4392-amd-gpus-docs branch from 06044ee to 7355f4b Compare January 9, 2025 13:28
@chtyler chtyler force-pushed the ENG-4392-amd-gpus-docs branch from 7355f4b to 844091f Compare January 17, 2025 15:09
@chtyler chtyler force-pushed the ENG-4392-amd-gpus-docs branch from 5264ee9 to bd012fa Compare January 20, 2025 15:23
@chtyler chtyler changed the title DRAFT - Do not merge: ENG-4292 - Initial changes for ODH AMD GPU docs DRAFT - ENG-4292 - Initial changes for ODH AMD GPU docs Jan 20, 2025
Copy link
Contributor

@aduquett aduquett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, looks great overall!

ifndef::upstream[]
| Ray ROCm
endif::[]
ifdef::upstream[]
| Ray ROCm
| If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the Ray ROCm base image to gain access to the AMD ROCm software stack.
Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this missing an endif::[] here?


You can use AMD GPUs with {productname-short} to accelerate AI and machine learning (ML) workloads. AMD GPUs provide high-performance compute capabilities, allowing users to process large data sets, train deep neural networks, and perform complex inference tasks more efficiently.

Integrating AMD GPUs with {productname-short} involves the following components and benefits:
Copy link
Contributor

@aduquett aduquett Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (from comment on line 16):
Integrating AMD GPUs with {productname-short} involves the following components:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

* **AMD GPU Operator**:
The AMD GPU Operator simplifies GPU integration by automating driver installation, device plugin setup, and node labeling for GPU resource management. It ensures compatibility between OpenShift and AMD hardware while enabling scaling of GPU-enabled workloads.

* **Why Use AMD GPUs with {productname-short}?**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This third bullet doesn't really flow with the first two. Suggestion:
Integrating AMD GPUs with {productname-short} provides the following benefits:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd delete this whole section, the intro paragraph ("AMD GPUs provide high-performance compute capabilities, [...]") is imo sufficient.

Comment on lines +14 to +20
* Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
endif::[]
ifdef::cloud-service[]
* You have logged in to OpenShift.
* You have the `cluster-admin` role in OpenShift.
* You have installed your AMD GPU and confirmed that it is detected in your environment.
* Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use {openshift-platform} instead of OpenShift

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

endif::[]

.Procedure
. Install version 1.1.1 of the AMD GPU Operator, as described in link:https://dcgpu.docs.amd.com/projects/gpu-operator/en/main/installation/openshift-olm.html[Install AMD GPU Operator on OpenShift].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to specify the version here? If not, it might be easier to leave it out to avoid having to update it later. Or create an attribute.

** To enable NVIDIA GPUs on OpenShift, you must install the link:https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator].
* AMD graphics processing units (GPUs)
** Use the AMD GPU Operator to enable AMD GPUs for workloads such as AI/ML training and inference.
** To enable AMD GPUs on OpenShift, you must:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

** To enable AMD GPUs on {openshift-platform}, you must do the following tasks:

Comment on lines +11 to +15
* You have a running OpenShift cluster with a node equipped with an AMD GPU.
* You have access to the OpenShift CLI (`oc`) and terminal access to the node.

.Procedure
. Use the OpenShift CLI to verify if GPU resources are allocatable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use {openshift-platform} instead of OpenShift

----
oc describe node <node_name>
----
.. In the output, locate the **Capacity** and **Allocatable** sections and confirm that `amd.com/gpu` is listed. Example output:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: For example: instead of Example output: for consistency.

----
lspci | grep -E "MI210|MI250|MI300"
----
.. Verify that the output includes one of the AMD GPU models. Example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Use For example: instead of Example:


.Verification
* The `oc describe node <node_name>` command lists `amd.com/gpu` under **Capacity** and **Allocatable**.
* The `lspci` command output identifies an AMD GPU as a PCI device matching one of the specified models (e.g., MI210, MI250, MI300).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per Style Guide, do not use e.g - replace with:
(for example, MI210, MI250, MI300)

* AMD graphics processing units (GPUs)
** Use the AMD GPU Operator to enable AMD GPUs for workloads such as AI/ML training and inference.
** To enable AMD GPUs on OpenShift, you must:
*** Install version 1.1.1 of the AMD GPU Operator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to specify the version here? If not, it might be easier to leave it out to avoid having to update it later. Or create an attribute.

* AMD graphics processing units (GPUs)
** Use the AMD GPU Operator to enable AMD GPUs for workloads such as AI/ML training and inference.
** To enable AMD GPUs on OpenShift, you must:
*** Install version 1.1.1 of the AMD GPU Operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded operator version?

@chtyler chtyler marked this pull request as draft January 20, 2025 16:35
@chtyler chtyler changed the title DRAFT - ENG-4292 - Initial changes for ODH AMD GPU docs [DRAFT] - ENG-4292 - Initial changes for ODH AMD GPU docs Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants