-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it easy to describe the preempting workload #4038
Comments
Have you looked at kueuectrl? maybe the UX would be better using the CLI? |
I don't remember if kueuectl would allow it, probably not. However, in any case as an external binary the only way to find the preempting workload by UID would be to list all workloads first, which is not scalable indeed. I like the idea of adding the UID as a label. cc @tenzen-y WDYT? |
Thank you for creating this issue. IMO, instead of label with UID, I would propose the CRD field-selector: https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4358-custom-resource-field-selectors Have you checked if we can implement a custom field selector for Workload by the feature? |
I have not checked this myself. It might be worth trying as an alternative, but not sure this works with
|
I see. In that case, can we try confirming if the custom CRD field is selected? This can be confirmed easily, IMO. |
sgtm, we can start by checking if the https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4358-custom-resource-field-selectors#proposal can be used to filter the workloads first |
Looks like controllergen has support for a However the Workload UID comes from k8s.io/apimachinery/pkg/apis/meta/v1.ObjectMeta (https://github.com/kubernetes-sigs/kueue/blob/main/apis/kueue/v1beta1/workload_types.go#L616) so I'm not sure how we could add that marker |
Actually, I think in this use-case we need to index Also, I synced with @deads2k and the KEP only supports scalar fields, so it would not work for |
Looked into the selectableFileds some more. JSON paths into metadata are explicitly now allowed, from https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#selectablefield-v1-apiextensions-k8s-io:
Which aligns with the proposal KEP validation rules section: https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/4358-custom-resource-field-selectors/README.md#validation-rules |
I'm currently a slightly confusing since the KEP uses metadata fields as examples: https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/4358-custom-resource-field-selectors/README.md#openapi-discovery And they mention |
@tenzen-y yes I tested out in main...avrittrohwer:kueue:log. Validation error when applying the CRD to a 1.31 kind cluster:
I tested with // +kubebuilder:selectablefield:JSONPath=`.spec.queueName` which worked so I think the issue is with the metadata json path |
Thank you for confirming that! As I can check impl, it seems to be correct... Lets's add annotation if @mimowo does not have any objections. |
You mean the "UID" label? Sure we can do it, and mark it as immutable since UIDs are immutable. |
Yeah, I meant UID label. |
Any thoughts on how we want to handle MultiKueue remote workloads? Should the UID label just get dropped when we make a remote copy? |
I see, why the question: because the UID from the worker cluster (present in the condition message of the preempted workload) would not be found anyway on the management cluster. I think ideally:
WDYT? |
Thinking about it more. To be fair I would just do now 1. unconditionally, and defer the MK 2. and 3. as a future improvement. |
It is not that much more effort to add a |
Actually, in the name of keeping integration test changes more scoped in each PR, lets defer all the 'main-workload-uid' labeling on the remote objects to a follow-up change |
The naive approach of adding an update call in Workload Reconcile loop is negatively impacting the scheduability perf tests: #4339 (comment). It seems like increasing the Kueue qps and burst parameters could mitigate this. Taking a step back however, is there a good reason in the |
Logging full name and namespace is ok. We actually have a PR which does it in 0.11: 7dcff89. I think we could justify cherry-picking it to 0.10. However, when creating the message for the condition we only include UID to avoid leaking of potentially namespace-sensitive information which could be present just in the workload name. The analogous approach is taken for preemption in the core k8s (for events and condition): https://github.com/kubernetes/kubernetes/blob/0e2a2afc4c7687e71f2dddbb7edfd03f082c6876/pkg/scheduler/framework/preemption/preemption.go#L198 |
Ah okay. Any ideas on an existing controller Workload update or patch call that we could use to also update the labels? I've run the scheduling perf tests a bunch and adding another update call is adding too much overhead to the workload controller loop |
Maybe I could update the defaulting workload webhook to also run on updates and do the label defaulting there? It is not possible on creates because the workload object does not have UID yet |
How much of a hit it is? Maybe we could consider just swallowing the pill and bump the thresholds.
This actually sounds like it could add even more overhead (but maybe it would not be visible in the perf tests, cannot recall if they install webhooks). |
@mimowo @avrittrohwer We should not record preemptor name and namespace in events and logging since those are PII. You can find a similar discussion: #1942 (comment) |
I know, I was hoping this will change with MAP, as then it all happens within API server, but it seems it will not be visible in the CEL rules (though I haven't seen explicit mention in the KEP). |
I think no reason to revert, this PR does exactly that - only adds to logging, but does not modify the message used on the condition. Also, we have integration tests showing the message format:
|
Ah, I see. It seems to focus only on logging. In that case, it should be fine. There are no leaks between tenants. |
Instead of |
Taking a step back - another option is that workloads already contain owner Job UID: kueue.x-k8s.io/job-uid. So, maybe in the messages for preemption we also record Job UID. Then, a user could use it for fast fetch. This would be no new requests, and no performance hit. The only problem is that with Pod Groups. However, maybe this is a corner case. Also, we could for a pod group record UID of the "first" pod. This would already allow users to identify the preempting group/ |
wdyt @tenzen-y @avrittrohwer about the Job UID approach for the fast search? |
I don't think that would help, because we have the same problem of 'how do I get a job by its UID' which is not supported by kubernetes: kubernetes/kubernetes#20572 |
We will search workloads by JobUID, as it is already recorded on the workload object: kueue/pkg/controller/jobframework/reconciler.go Line 1022 in 099a239
Lavereging the fact at the moment of wokload creation we already have JobUID, but not workloadUID. |
Right, but once the user has JobUID, how do they kubectl get it by UID? |
Oh they could do |
Adding jobUID to the condition and event message is a much easier change. Lets do that |
That sounds reasonable. |
What would you like to be added:
Make it easy to
kubectl describe
the workload that evicted another workload.Why is this needed:
Kueue Workloads have events like
kueue-admission Preempted to accommodate a workload (UID: <UID>) due to fair sharing within the cohort
As far as I can tell kubectl --selector field only supports filtering on values in the object .metadata.labels. The Workload UID is in the .metadata.UID field which can not be filtered on by kubectl --selector. For clusters with few workloads grepping on kubectl get workloads -A would work but does not scale for clusters with thousands of workloads.
Completion requirements:
Make it easy to
kubectl describe
the Workload that preempted another workload. Simple solution: add the Workload UID to the Workload labels.This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: