-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] ENG-16985 - documented gaudi operator upgrade steps #623
base: main
Are you sure you want to change the base?
[DRAFT] ENG-16985 - documented gaudi operator upgrade steps #623
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments.
endif::[] | ||
|
||
.Procedure | ||
. Install version 1.18 of the Intel Gaudi AI Accelerator Operator, as described in link:https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Intel_Gaudi_Base_Operator/index.html[GaudiAI Operator OpenShift installation]. | ||
. Install the latest version of the Intel Gaudi AI Accelerator Operator, as described in link:https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/OpenShift_Installation/index.html[Intel Gaudi AI Operator OpenShift installation]. | ||
. If you are upgrading to a new version of the Intel Gaudi Accelerator Operator, you must increase the per-pod PID limit to a value over 20000, Red Hat recommends using a PID limit of 32768. This avoids `Resource temporarily unavailable` errors occurring due to PID exhaustion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using semi-colon and {org-name}:
to a value over 20000; {org-name} recommends using a PID limit of 32768.
oc apply -f custom-kubelet-pidslimit.yaml | ||
---- | ||
+ | ||
This operation causes the node to reboot. For more information on node rebooting, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/nodes/working-with-nodes#nodes-nodes-rebooting[Understanding node rebooting]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but probably don't need "on node rebooting" here:
For more information, see link:
@@ -33,5 +33,5 @@ The presence of Intel Gaudi AI accelerators in your deployment, as indicated by | |||
.Additional resources | |||
* link:https://linux.die.net/man/8/lspci[lspci(8) - Linux man page] | |||
* link:https://aws.amazon.com/ec2/instance-types/dl1/[Amazon EC2 DL1 Instances] | |||
* link:https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Intel_Gaudi_Base_Operator/index.html[Deploying the Intel Gaudi AI Accelerator Operator] | |||
* link:https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/OpenShift_Installation/index.html[Deploying the Intel Gaudi AI Accelerator Operator] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the same link name as above?
[Intel Gaudi AI Operator OpenShift installation]
Description
This PR describes the steps to enable the Intel Gaudi AI Operator in ODH.
How Has This Been Tested?
Merge criteria: