-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] - ENG-4292 - Initial changes for ODH AMD GPU docs #561
base: main
Are you sure you want to change the base?
Conversation
06044ee
to
7355f4b
Compare
7355f4b
to
844091f
Compare
…everal modules and correct some formatting issues and content errors
5264ee9
to
bd012fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, looks great overall!
ifndef::upstream[] | ||
| Ray ROCm | ||
endif::[] | ||
ifdef::upstream[] | ||
| Ray ROCm | ||
| If you are working with compute-intensive models and you want to accelerate the training job with AMD GPU support, you can use the Ray ROCm base image to gain access to the AMD ROCm software stack. | ||
Using this software stack, you can accelerate your work by using libraries and tools that are optimized for AMD GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this missing an endif::[]
here?
|
||
You can use AMD GPUs with {productname-short} to accelerate AI and machine learning (ML) workloads. AMD GPUs provide high-performance compute capabilities, allowing users to process large data sets, train deep neural networks, and perform complex inference tasks more efficiently. | ||
|
||
Integrating AMD GPUs with {productname-short} involves the following components and benefits: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion (from comment on line 16):
Integrating AMD GPUs with {productname-short} involves the following components:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
* **AMD GPU Operator**: | ||
The AMD GPU Operator simplifies GPU integration by automating driver installation, device plugin setup, and node labeling for GPU resource management. It ensures compatibility between OpenShift and AMD hardware while enabling scaling of GPU-enabled workloads. | ||
|
||
* **Why Use AMD GPUs with {productname-short}?** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This third bullet doesn't really flow with the first two. Suggestion:
Integrating AMD GPUs with {productname-short} provides the following benefits:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd delete this whole section, the intro paragraph ("AMD GPUs provide high-performance compute capabilities, [...]") is imo sufficient.
* Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS). | ||
endif::[] | ||
ifdef::cloud-service[] | ||
* You have logged in to OpenShift. | ||
* You have the `cluster-admin` role in OpenShift. | ||
* You have installed your AMD GPU and confirmed that it is detected in your environment. | ||
* Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use {openshift-platform}
instead of OpenShift
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
endif::[] | ||
|
||
.Procedure | ||
. Install version 1.1.1 of the AMD GPU Operator, as described in link:https://dcgpu.docs.amd.com/projects/gpu-operator/en/main/installation/openshift-olm.html[Install AMD GPU Operator on OpenShift]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important to specify the version here? If not, it might be easier to leave it out to avoid having to update it later. Or create an attribute.
** To enable NVIDIA GPUs on OpenShift, you must install the link:https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html[NVIDIA GPU Operator]. | ||
* AMD graphics processing units (GPUs) | ||
** Use the AMD GPU Operator to enable AMD GPUs for workloads such as AI/ML training and inference. | ||
** To enable AMD GPUs on OpenShift, you must: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
** To enable AMD GPUs on {openshift-platform}, you must do the following tasks:
* You have a running OpenShift cluster with a node equipped with an AMD GPU. | ||
* You have access to the OpenShift CLI (`oc`) and terminal access to the node. | ||
|
||
.Procedure | ||
. Use the OpenShift CLI to verify if GPU resources are allocatable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use {openshift-platform}
instead of OpenShift
---- | ||
oc describe node <node_name> | ||
---- | ||
.. In the output, locate the **Capacity** and **Allocatable** sections and confirm that `amd.com/gpu` is listed. Example output: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: For example:
instead of Example output:
for consistency.
---- | ||
lspci | grep -E "MI210|MI250|MI300" | ||
---- | ||
.. Verify that the output includes one of the AMD GPU models. Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Use For example:
instead of Example:
|
||
.Verification | ||
* The `oc describe node <node_name>` command lists `amd.com/gpu` under **Capacity** and **Allocatable**. | ||
* The `lspci` command output identifies an AMD GPU as a PCI device matching one of the specified models (e.g., MI210, MI250, MI300). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per Style Guide, do not use e.g
- replace with:
(for example, MI210, MI250, MI300)
* AMD graphics processing units (GPUs) | ||
** Use the AMD GPU Operator to enable AMD GPUs for workloads such as AI/ML training and inference. | ||
** To enable AMD GPUs on OpenShift, you must: | ||
*** Install version 1.1.1 of the AMD GPU Operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important to specify the version here? If not, it might be easier to leave it out to avoid having to update it later. Or create an attribute.
* AMD graphics processing units (GPUs) | ||
** Use the AMD GPU Operator to enable AMD GPUs for workloads such as AI/ML training and inference. | ||
** To enable AMD GPUs on OpenShift, you must: | ||
*** Install version 1.1.1 of the AMD GPU Operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardcoded operator version?
Description
These are the changes to document the introduction of the AMD GPUs GA at 2.16.1 for ODH. AMD have provided all of the in-depth enablement instructions for deploying the AMD GPU Operator on RHOAI. Therefore, we link out to the AMD instructions where and when possible. In some areas, we have needed to add our own content.
How Has This Been Tested?
Merge criteria: