-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rayjob event can't set redis password in both GCSFaultTolerance and rayStartParam #3093
Rayjob event can't set redis password in both GCSFaultTolerance and rayStartParam #3093
Conversation
@@ -164,3 +137,47 @@ func ValidateRayServiceSpec(rayService *rayv1.RayService) error { | |||
} | |||
return nil | |||
} | |||
|
|||
func ValidateGCSFaultTolerance(spec *rayv1.RayClusterSpec, annotations map[string]string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put GCSFaultTolerance-related checks in a function to make it reusable in RayCluster and RayJob. The code in the function is mainly from
kuberay/ray-operator/controllers/ray/utils/validation.go
Lines 34 to 72 in 5ccf361
if instance.Annotations[RayFTEnabledAnnotationKey] != "" && instance.Spec.GcsFaultToleranceOptions != nil { | |
return fmt.Errorf("%s annotation and GcsFaultToleranceOptions are both set. "+ | |
"Please use only GcsFaultToleranceOptions to configure GCS fault tolerance", RayFTEnabledAnnotationKey) | |
} | |
if !IsGCSFaultToleranceEnabled(*instance) { | |
if EnvVarExists(RAY_REDIS_ADDRESS, instance.Spec.HeadGroupSpec.Template.Spec.Containers[RayContainerIndex].Env) { | |
return fmt.Errorf("%s is set which implicitly enables GCS fault tolerance, "+ | |
"but GcsFaultToleranceOptions is not set. Please set GcsFaultToleranceOptions "+ | |
"to enable GCS fault tolerance", RAY_REDIS_ADDRESS) | |
} | |
} | |
headContainer := instance.Spec.HeadGroupSpec.Template.Spec.Containers[RayContainerIndex] | |
if instance.Spec.GcsFaultToleranceOptions != nil { | |
if redisPassword := instance.Spec.HeadGroupSpec.RayStartParams["redis-password"]; redisPassword != "" { | |
return fmt.Errorf("cannot set `redis-password` in rayStartParams when " + | |
"GcsFaultToleranceOptions is enabled - use GcsFaultToleranceOptions.RedisPassword instead") | |
} | |
if EnvVarExists(REDIS_PASSWORD, headContainer.Env) { | |
return fmt.Errorf("cannot set `REDIS_PASSWORD` env var in head Pod when " + | |
"GcsFaultToleranceOptions is enabled - use GcsFaultToleranceOptions.RedisPassword instead") | |
} | |
if EnvVarExists(RAY_REDIS_ADDRESS, headContainer.Env) { | |
return fmt.Errorf("cannot set `RAY_REDIS_ADDRESS` env var in head Pod when " + | |
"GcsFaultToleranceOptions is enabled - use GcsFaultToleranceOptions.RedisAddress instead") | |
} | |
if instance.Annotations[RayExternalStorageNSAnnotationKey] != "" { | |
return fmt.Errorf("cannot set `ray.io/external-storage-namespace` annotation when " + | |
"GcsFaultToleranceOptions is enabled - use GcsFaultToleranceOptions.ExternalStorageNamespace instead") | |
} | |
} | |
if instance.Spec.HeadGroupSpec.RayStartParams["redis-username"] != "" || EnvVarExists(REDIS_USERNAME, headContainer.Env) { | |
return fmt.Errorf("cannot set redis username in rayStartParams or environment variables" + | |
" - use GcsFaultToleranceOptions.RedisUsername instead") | |
} |
{ | ||
Env: tt.envVars, | ||
}, | ||
err := ValidateGCSFaultTolerance(&rayv1.RayClusterSpec{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to ValidateGCSFaultTolerance
because this test is testing GcsFaultToleranceOptions.
@kevin85421 PTAL |
@@ -112,6 +76,13 @@ func ValidateRayJobSpec(rayJob *rayv1.RayJob) error { | |||
if rayJob.Spec.RayClusterSpec == nil && !isClusterSelectorMode { | |||
return fmt.Errorf("one of RayClusterSpec or ClusterSelector must be set") | |||
} | |||
|
|||
if rayJob.Spec.RayClusterSpec != nil { | |||
if err := ValidateGCSFaultTolerance(rayJob.Spec.RayClusterSpec, rayJob.Annotations); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to validate the entire RayCluster spec in the RayJob controller instead of validating different parts one by one. That is,
if rayJob.Spec.RayClusterSpec != nil {
if err := ValidateRayClusterSpec(rayJob.Spec.RayClusterSpec, rayJob.Annotations); err != nil {
...
}
}
In addition, this assumes that RayJob will propagate annotations to RayCluster. Could you please check whether we have tests to ensure this behavior? If not, you can add tests in this PR or open another PR to add tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there is no such test to ensure that. I'll create another PR to add tests.
@@ -164,3 +135,47 @@ func ValidateRayServiceSpec(rayService *rayv1.RayService) error { | |||
} | |||
return nil | |||
} | |||
|
|||
func ValidateGCSFaultTolerance(spec *rayv1.RayClusterSpec, annotations map[string]string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do the refactoring, but it's better to do it in a follow-up PR. Let's focus on calling ValidateRayClusterSpec
in RayJob controller in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has been removed and the code has put back inside ValidateRayClusterSpec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split IsAutoscalingEnabled
into a separate function to make it easier to review.
864bb09
to
7615508
Compare
7615508
to
7dc75c2
Compare
@kevin85421 PTAL |
@@ -201,30 +229,44 @@ func TestValidateRayClusterSpecGcsFaultToleranceOptions(t *testing.T) { | |||
expectError: true, | |||
errorMessage: errorMessageExternalStorageNamespaceConflict, | |||
}, | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two tests have already been tested in TestValidateRayClusterSpecRedisUsername
.
@@ -156,6 +163,27 @@ func TestValidateRayClusterSpecGcsFaultToleranceOptions(t *testing.T) { | |||
expectError: true, | |||
errorMessage: errorMessageRedisAddressSet, | |||
}, | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two tests have already been tested in TestValidateRayClusterSpecRedisPassword
.
@@ -576,6 +618,14 @@ func TestValidateRayJobStatus(t *testing.T) { | |||
} | |||
|
|||
func TestValidateRayJobSpec(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect a test with a RayJob with an invalid RayClusterSpec
Would you mind rebasing with the master branch? There is a CI fix in the master branch. |
Signed-off-by: fscnick <[email protected]>
Signed-off-by: fscnick <[email protected]>
Signed-off-by: fscnick <[email protected]>
…tTolerance Signed-off-by: fscnick <[email protected]>
Signed-off-by: fscnick <[email protected]>
Signed-off-by: fscnick <[email protected]>
Signed-off-by: fscnick <[email protected]>
Signed-off-by: fscnick <[email protected]>
Signed-off-by: fscnick <[email protected]>
… RayJob Signed-off-by: fscnick <[email protected]>
5b61cb9
to
6d4d001
Compare
err = ValidateRayJobSpec(&rayv1.RayJob{ | ||
Spec: rayv1.RayJobSpec{ | ||
RayClusterSpec: &rayv1.RayClusterSpec{ | ||
HeadGroupSpec: rayv1.HeadGroupSpec{}, | ||
}, | ||
}, | ||
}) | ||
require.ErrorContains(t, err, "headGroupSpec should have at least one container") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the invalid RayClusterSpec in RayJob according to suggestion
#3093 (comment)
Nice! Would you mind opening another PR to validate the cluster spec in RayService? |
Sure, I'll do that. |
Why are these changes needed?
If an user sets redis password in both GCSFaultTolerance and rayStartParam, it will prompt CR event in RayCluster.
However, it's not show in RayJob. Thus, the user need to find the CR event in RayCluster to know the cause.
The PR shows CR event on RayJob if it gets this kind of error.
Here is result of screenshot.

Related issue number
#2986
Checks