Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to create RayCluster with an imagePullSecret #649

Closed
dgrove-oss opened this issue Jan 14, 2025 · 8 comments
Closed

Failure to create RayCluster with an imagePullSecret #649

dgrove-oss opened this issue Jan 14, 2025 · 8 comments

Comments

@dgrove-oss
Copy link
Collaborator

dgrove-oss commented Jan 14, 2025

Describe the Bug

Doing an oc apply of the attached yaml for a RayCluster
broken-raycluster.yaml.txt
where the head pod has a user-provided imagePullSecret results in an endless reconciliation loop with the head pod being rapidly deleted and recreated. It looks to me like this is due to the logic added in #601 is not acting as intended.

Codeflare Stack Component Versions

Please specify the component versions in which you have encountered this bug.

Running RHOAI 2.16 on OpenShift 4.14.

Steps to Reproduce the Bug

  1. oc apply the yaml for the RayCluster
  2. observe an endless cascade of head pods being rapidly deleted

What Have You Already Tried to Debug the Issue?

Have recreated the problem multiple times. Happens reliably.

Have verified that removing the imagePullSecret from the head pod specification results in the RayCluster being created successful as expected.

Expected Behavior

Users should be able to provide imagePullSecrets for the head node of a RayCluster to enable the use of private registries.

Screenshots, Console Output, Logs, etc.

I've attached the relevant log snippets from the codeflare-operator codeflare-log.txt and kuberay-operator
kuberay.txt.

@dgrove-oss
Copy link
Collaborator Author

/cc @varshaprasad96

@VassilisVassiliadis
Copy link

My workaround was to remove the imagePullSecret from my head node spec and then manually link the secret with the ${ray cluster name}-oauth-proxy service account. e.g.

oc secrets link my-ray-cluster-oauth-proxy my-image-pull-secret --for=pull

Following that, the head node pod gets deleted and the next time it gets created it lists the imagePullSecret I manually added to the service account.

@varshaprasad96
Copy link
Contributor

varshaprasad96 commented Jan 15, 2025

@dgrove-oss IIRC the reason for introducing that feature was because there was a race condition while the code flare-operator reconciled the head pod before OpenShift could add secret to the service account resulting in a failure of pulling image and requiring manual restart. Which is why the logic was introduced such that we check if the SA has secrets ready and restart the headpod. @Ygnas @astefanutti could you please correct me if my understanding is correct.

Does the continuous restart happen even after the secret is available?

@VassilisVassiliadis
Copy link

I think the problem is that pods which define their own imagePullSecrets don't inherit those that the ServiceAccount specifies: https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#serviceaccount-admission-controller

  1. If the spec of the incoming Pod doesn't already contain any imagePullSecrets, then the admission controller adds imagePullSecrets, copying them from the ServiceAccount.

As a result, the logic in the PR 601 will never see all the imagePullSecrets of the SA in the pod and it will keep deleting the head node forever.

@sutaakar
Copy link
Contributor

Issue is fixed for CodeFlare operator v1.14.0 release.

@VassilisVassiliadis
Copy link

I think there might still be a problem here for the scenario where a RayCluster defines an imagePullSecret for the head node and codeflare injects the oauth-proxy side-car container in the head node.

The imagePullSecrets of the pod won't inherit those defined in the oauth-proxy ServiceAccount and as a result the oauth-proxy sidecar container won't be able to pull its image.

@sutaakar
Copy link
Contributor

Default oauth proxy image is taken from registry.redhat.io.
Pull secret for registry.redhat.io is specified in global cluster pull secret (Secret pull-secret in namespace openshift-config).

raycluster-oauth-proxy pull secret contains authentication against internal OpenShift registry only, it is needed only for images taken from internal registry (i.e. ImageStream images).

So raycluster-oauth-proxy is not relevant to the oauth proxy image (unless user override default oauth proxy image with an image from internal registry - which IMHO is not a supported use case).

raycluster-oauth-proxy is relevant just for Ray image pulling from internal registry (the naming is confusing here).

@VassilisVassiliadis
Copy link

oh I didn't realize that! thanks for the clarification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants