Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] ENG-16985 - documented gaudi operator upgrade steps #623

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

chtyler
Copy link
Contributor

@chtyler chtyler commented Jan 30, 2025

Description

This PR describes the steps to enable the Intel Gaudi AI Operator in ODH.

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@chtyler chtyler marked this pull request as draft January 30, 2025 17:18
@chtyler chtyler changed the title ENG-16985 - documented gaudi operator upgrade steps [DRAFT] ENG-16985 - documented gaudi operator upgrade steps Jan 30, 2025
Copy link
Contributor

@aduquett aduquett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments.

endif::[]

.Procedure
. Install version 1.18 of the Intel Gaudi AI Accelerator Operator, as described in link:https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Intel_Gaudi_Base_Operator/index.html[GaudiAI Operator OpenShift installation].
. Install the latest version of the Intel Gaudi AI Accelerator Operator, as described in link:https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/OpenShift_Installation/index.html[Intel Gaudi AI Operator OpenShift installation].
. If you are upgrading to a new version of the Intel Gaudi Accelerator Operator, you must increase the per-pod PID limit to a value over 20000, Red Hat recommends using a PID limit of 32768. This avoids `Resource temporarily unavailable` errors occurring due to PID exhaustion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using semi-colon and {org-name}:
to a value over 20000; {org-name} recommends using a PID limit of 32768.

oc apply -f custom-kubelet-pidslimit.yaml
----
+
This operation causes the node to reboot. For more information on node rebooting, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/nodes/working-with-nodes#nodes-nodes-rebooting[Understanding node rebooting].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but probably don't need "on node rebooting" here:
For more information, see link:

@@ -33,5 +33,5 @@ The presence of Intel Gaudi AI accelerators in your deployment, as indicated by
.Additional resources
* link:https://linux.die.net/man/8/lspci[lspci(8) - Linux man page]
* link:https://aws.amazon.com/ec2/instance-types/dl1/[Amazon EC2 DL1 Instances]
* link:https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Intel_Gaudi_Base_Operator/index.html[Deploying the Intel Gaudi AI Accelerator Operator]
* link:https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/OpenShift_Installation/index.html[Deploying the Intel Gaudi AI Accelerator Operator]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the same link name as above?
[Intel Gaudi AI Operator OpenShift installation]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants