-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to create RayCluster with an imagePullSecret #649
Comments
/cc @varshaprasad96 |
My workaround was to remove the imagePullSecret from my head node spec and then manually link the secret with the
Following that, the head node pod gets deleted and the next time it gets created it lists the imagePullSecret I manually added to the service account. |
@dgrove-oss IIRC the reason for introducing that feature was because there was a race condition while the code flare-operator reconciled the head pod before OpenShift could add secret to the service account resulting in a failure of pulling image and requiring manual restart. Which is why the logic was introduced such that we check if the SA has secrets ready and restart the headpod. @Ygnas @astefanutti could you please correct me if my understanding is correct. Does the continuous restart happen even after the secret is available? |
I think the problem is that pods which define their own imagePullSecrets don't inherit those that the ServiceAccount specifies: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller
As a result, the logic in the PR 601 will never see all the imagePullSecrets of the SA in the pod and it will keep deleting the head node forever. |
Issue is fixed for CodeFlare operator v1.14.0 release. |
I think there might still be a problem here for the scenario where a RayCluster defines an imagePullSecret for the head node and codeflare injects the oauth-proxy side-car container in the head node. The imagePullSecrets of the pod won't inherit those defined in the oauth-proxy ServiceAccount and as a result the oauth-proxy sidecar container won't be able to pull its image. |
Default oauth proxy image is taken from
So
|
oh I didn't realize that! thanks for the clarification |
Describe the Bug
Doing an oc apply of the attached yaml for a RayCluster
broken-raycluster.yaml.txt
where the head pod has a user-provided imagePullSecret results in an endless reconciliation loop with the head pod being rapidly deleted and recreated. It looks to me like this is due to the logic added in #601 is not acting as intended.
Codeflare Stack Component Versions
Please specify the component versions in which you have encountered this bug.
Running RHOAI 2.16 on OpenShift 4.14.
Steps to Reproduce the Bug
What Have You Already Tried to Debug the Issue?
Have recreated the problem multiple times. Happens reliably.
Have verified that removing the imagePullSecret from the head pod specification results in the RayCluster being created successful as expected.
Expected Behavior
Users should be able to provide imagePullSecrets for the head node of a RayCluster to enable the use of private registries.
Screenshots, Console Output, Logs, etc.
I've attached the relevant log snippets from the codeflare-operator codeflare-log.txt and kuberay-operator
kuberay.txt.
The text was updated successfully, but these errors were encountered: