Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rayjob event can't set redis password in both GCSFaultTolerance and rayStartParam #3093

Merged
merged 10 commits into from
Mar 2, 2025

Conversation

fscnick
Copy link
Contributor

@fscnick fscnick commented Feb 22, 2025

Why are these changes needed?

If an user sets redis password in both GCSFaultTolerance and rayStartParam, it will prompt CR event in RayCluster.
However, it's not show in RayJob. Thus, the user need to find the CR event in RayCluster to know the cause.

The PR shows CR event on RayJob if it gets this kind of error.

Here is result of screenshot.
rayjob fail

Related issue number

#2986

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@@ -164,3 +137,47 @@ func ValidateRayServiceSpec(rayService *rayv1.RayService) error {
}
return nil
}

func ValidateGCSFaultTolerance(spec *rayv1.RayClusterSpec, annotations map[string]string) error {
Copy link
Contributor Author

@fscnick fscnick Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put GCSFaultTolerance-related checks in a function to make it reusable in RayCluster and RayJob. The code in the function is mainly from

if instance.Annotations[RayFTEnabledAnnotationKey] != "" && instance.Spec.GcsFaultToleranceOptions != nil {
return fmt.Errorf("%s annotation and GcsFaultToleranceOptions are both set. "+
"Please use only GcsFaultToleranceOptions to configure GCS fault tolerance", RayFTEnabledAnnotationKey)
}
if !IsGCSFaultToleranceEnabled(*instance) {
if EnvVarExists(RAY_REDIS_ADDRESS, instance.Spec.HeadGroupSpec.Template.Spec.Containers[RayContainerIndex].Env) {
return fmt.Errorf("%s is set which implicitly enables GCS fault tolerance, "+
"but GcsFaultToleranceOptions is not set. Please set GcsFaultToleranceOptions "+
"to enable GCS fault tolerance", RAY_REDIS_ADDRESS)
}
}
headContainer := instance.Spec.HeadGroupSpec.Template.Spec.Containers[RayContainerIndex]
if instance.Spec.GcsFaultToleranceOptions != nil {
if redisPassword := instance.Spec.HeadGroupSpec.RayStartParams["redis-password"]; redisPassword != "" {
return fmt.Errorf("cannot set `redis-password` in rayStartParams when " +
"GcsFaultToleranceOptions is enabled - use GcsFaultToleranceOptions.RedisPassword instead")
}
if EnvVarExists(REDIS_PASSWORD, headContainer.Env) {
return fmt.Errorf("cannot set `REDIS_PASSWORD` env var in head Pod when " +
"GcsFaultToleranceOptions is enabled - use GcsFaultToleranceOptions.RedisPassword instead")
}
if EnvVarExists(RAY_REDIS_ADDRESS, headContainer.Env) {
return fmt.Errorf("cannot set `RAY_REDIS_ADDRESS` env var in head Pod when " +
"GcsFaultToleranceOptions is enabled - use GcsFaultToleranceOptions.RedisAddress instead")
}
if instance.Annotations[RayExternalStorageNSAnnotationKey] != "" {
return fmt.Errorf("cannot set `ray.io/external-storage-namespace` annotation when " +
"GcsFaultToleranceOptions is enabled - use GcsFaultToleranceOptions.ExternalStorageNamespace instead")
}
}
if instance.Spec.HeadGroupSpec.RayStartParams["redis-username"] != "" || EnvVarExists(REDIS_USERNAME, headContainer.Env) {
return fmt.Errorf("cannot set redis username in rayStartParams or environment variables" +
" - use GcsFaultToleranceOptions.RedisUsername instead")
}

{
Env: tt.envVars,
},
err := ValidateGCSFaultTolerance(&rayv1.RayClusterSpec{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to ValidateGCSFaultTolerance because this test is testing GcsFaultToleranceOptions.

@fscnick fscnick marked this pull request as ready for review February 23, 2025 12:21
@fscnick
Copy link
Contributor Author

fscnick commented Feb 23, 2025

@kevin85421 PTAL

@@ -112,6 +76,13 @@ func ValidateRayJobSpec(rayJob *rayv1.RayJob) error {
if rayJob.Spec.RayClusterSpec == nil && !isClusterSelectorMode {
return fmt.Errorf("one of RayClusterSpec or ClusterSelector must be set")
}

if rayJob.Spec.RayClusterSpec != nil {
if err := ValidateGCSFaultTolerance(rayJob.Spec.RayClusterSpec, rayJob.Annotations); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to validate the entire RayCluster spec in the RayJob controller instead of validating different parts one by one. That is,

if rayJob.Spec.RayClusterSpec != nil {
  if err := ValidateRayClusterSpec(rayJob.Spec.RayClusterSpec, rayJob.Annotations); err != nil {
    ...
  }
}

In addition, this assumes that RayJob will propagate annotations to RayCluster. Could you please check whether we have tests to ensure this behavior? If not, you can add tests in this PR or open another PR to add tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is no such test to ensure that. I'll create another PR to add tests.

@@ -164,3 +135,47 @@ func ValidateRayServiceSpec(rayService *rayv1.RayService) error {
}
return nil
}

func ValidateGCSFaultTolerance(spec *rayv1.RayClusterSpec, annotations map[string]string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do the refactoring, but it's better to do it in a follow-up PR. Let's focus on calling ValidateRayClusterSpec in RayJob controller in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has been removed and the code has put back inside ValidateRayClusterSpec.

@kevin85421 kevin85421 self-assigned this Feb 23, 2025
@fscnick fscnick requested a review from kevin85421 February 24, 2025 14:48
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split IsAutoscalingEnabled into a separate function to make it easier to review.

@fscnick
Copy link
Contributor Author

fscnick commented Feb 28, 2025

@kevin85421 PTAL

@@ -201,30 +229,44 @@ func TestValidateRayClusterSpecGcsFaultToleranceOptions(t *testing.T) {
expectError: true,
errorMessage: errorMessageExternalStorageNamespaceConflict,
},
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two tests have already been tested in TestValidateRayClusterSpecRedisUsername.

@@ -156,6 +163,27 @@ func TestValidateRayClusterSpecGcsFaultToleranceOptions(t *testing.T) {
expectError: true,
errorMessage: errorMessageRedisAddressSet,
},
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two tests have already been tested in TestValidateRayClusterSpecRedisPassword.

@@ -576,6 +618,14 @@ func TestValidateRayJobStatus(t *testing.T) {
}

func TestValidateRayJobSpec(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect a test with a RayJob with an invalid RayClusterSpec

@kevin85421
Copy link
Member

Would you mind rebasing with the master branch? There is a CI fix in the master branch.

@fscnick fscnick force-pushed the rayjob-event-cant-set-pwd branch from 5b61cb9 to 6d4d001 Compare March 2, 2025 08:42
Comment on lines +709 to +716
err = ValidateRayJobSpec(&rayv1.RayJob{
Spec: rayv1.RayJobSpec{
RayClusterSpec: &rayv1.RayClusterSpec{
HeadGroupSpec: rayv1.HeadGroupSpec{},
},
},
})
require.ErrorContains(t, err, "headGroupSpec should have at least one container")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the invalid RayClusterSpec in RayJob according to suggestion
#3093 (comment)

@kevin85421
Copy link
Member

Nice! Would you mind opening another PR to validate the cluster spec in RayService?

@kevin85421 kevin85421 merged commit 6c4a77d into ray-project:master Mar 2, 2025
21 checks passed
tinaxfwu pushed a commit to tinaxfwu/kuberay that referenced this pull request Mar 2, 2025
@fscnick
Copy link
Contributor Author

fscnick commented Mar 3, 2025

Nice! Would you mind opening another PR to validate the cluster spec in RayService?

Sure, I'll do that.

@fscnick fscnick deleted the rayjob-event-cant-set-pwd branch March 3, 2025 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants