Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easy to describe the preempting workload #4038

Open
2 tasks
avrittrohwer opened this issue Jan 22, 2025 · 38 comments · May be fixed by #4524
Open
2 tasks

Make it easy to describe the preempting workload #4038

avrittrohwer opened this issue Jan 22, 2025 · 38 comments · May be fixed by #4524
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@avrittrohwer
Copy link
Contributor

avrittrohwer commented Jan 22, 2025

What would you like to be added:

Make it easy to kubectl describe the workload that evicted another workload.

Why is this needed:

Kueue Workloads have events like kueue-admission Preempted to accommodate a workload (UID: <UID>) due to fair sharing within the cohort

As far as I can tell kubectl --selector field only supports filtering on values in the object .metadata.labels. The Workload UID is in the .metadata.UID field which can not be filtered on by kubectl --selector. For clusters with few workloads grepping on kubectl get workloads -A would work but does not scale for clusters with thousands of workloads.

Completion requirements:

Make it easy to kubectl describe the Workload that preempted another workload. Simple solution: add the Workload UID to the Workload labels.

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • [ x] Docs update -> add docs on how to get details of preempting workload.

The artifacts should be linked in subsequent comments.

@avrittrohwer avrittrohwer added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 22, 2025
@avrittrohwer avrittrohwer changed the title Make it easy to find the preempting workload Make it easy to describe the preempting workload Jan 22, 2025
@kannon92
Copy link
Contributor

Have you looked at kueuectrl? maybe the UX would be better using the CLI?

@mimowo
Copy link
Contributor

mimowo commented Jan 23, 2025

I don't remember if kueuectl would allow it, probably not. However, in any case as an external binary the only way to find the preempting workload by UID would be to list all workloads first, which is not scalable indeed.

I like the idea of adding the UID as a label. cc @tenzen-y WDYT?

@tenzen-y
Copy link
Member

I don't remember if kueuectl would allow it, probably not. However, in any case as an external binary the only way to find the preempting workload by UID would be to list all workloads first, which is not scalable indeed.

I like the idea of adding the UID as a label. cc @tenzen-y WDYT?

Thank you for creating this issue. IMO, instead of label with UID, I would propose the CRD field-selector: https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4358-custom-resource-field-selectors

Have you checked if we can implement a custom field selector for Workload by the feature?

@mimowo
Copy link
Contributor

mimowo commented Jan 27, 2025

Have you checked if we can implement a custom field selector for Workload by the feature?

I have not checked this myself. It might be worth trying as an alternative, but not sure this works with metadata.ownerReferences.uid. Actually, I see we are indexing the field already in Kueue, but this is in-memory only:

OwnerReferenceUID = "metadata.ownerReferences.uid"

@tenzen-y
Copy link
Member

Have you checked if we can implement a custom field selector for Workload by the feature?

I have not checked this myself. It might be worth trying as an alternative, but not sure this works with metadata.ownerReferences.uid. Actually, I see we are indexing the field already in Kueue, but this is in-memory only:

kueue/pkg/controller/core/indexer/indexer.go

Line 40 in b9aa1c3

OwnerReferenceUID = "metadata.ownerReferences.uid"

I see. In that case, can we try confirming if the custom CRD field is selected? This can be confirmed easily, IMO.
After we confirm the CRD field selector is not runnable, we can add an owner to the label.

@mimowo
Copy link
Contributor

mimowo commented Jan 28, 2025

sgtm, we can start by checking if the https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4358-custom-resource-field-selectors#proposal can be used to filter the workloads first

@avrittrohwer
Copy link
Contributor Author

Looks like controllergen has support for a +kubebuilder:selectablefield marker: kubernetes-sigs/controller-tools#1039

However the Workload UID comes from k8s.io/apimachinery/pkg/apis/meta/v1.ObjectMeta (https://github.com/kubernetes-sigs/kueue/blob/main/apis/kueue/v1beta1/workload_types.go#L616) so I'm not sure how we could add that marker

@mimowo
Copy link
Contributor

mimowo commented Jan 30, 2025

Actually, I think in this use-case we need to index metadata.uid which is a scalar, so it's worth checking if +kubebuilder:selectablefield can be used.

Also, I synced with @deads2k and the KEP only supports scalar fields, so it would not work for metadata.ownerReferences which is a list (but IIUC it is not necessary in our case).

@avrittrohwer
Copy link
Contributor Author

Looked into the selectableFileds some more. JSON paths into metadata are explicitly now allowed, from https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#selectablefield-v1-apiextensions-k8s-io:

Must not point to metdata fields.

Which aligns with the proposal KEP validation rules section: https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/4358-custom-resource-field-selectors/README.md#validation-rules

@tenzen-y
Copy link
Member

tenzen-y commented Feb 7, 2025

https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/4358-custom-resource-field-selectors/README.md#validation-rules

I'm currently a slightly confusing since the KEP uses metadata fields as examples: https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/4358-custom-resource-field-selectors/README.md#openapi-discovery

And they mention may not in selectableFields may not refer to metadata fields.
Have you seen any validation errors if you add uid fieldSelector to any objects?

@avrittrohwer
Copy link
Contributor Author

@tenzen-y yes I tested out in main...avrittrohwer:kueue:log.

Validation error when applying the CRD to a 1.31 kind cluster:

The CustomResourceDefinition "workloads.kueue.x-k8s.io" is invalid: spec.selectableFields[0].jsonPath: Invalid value: ".metadata.UID": is an invalid path: does not refer to a valid field
make: *** [Makefile:235: install] Error 1

I tested with // +kubebuilder:selectablefield:JSONPath=`.spec.queueName` which worked so I think the issue is with the metadata json path

@tenzen-y
Copy link
Member

@tenzen-y yes I tested out in main...avrittrohwer:kueue:log.

Validation error when applying the CRD to a 1.31 kind cluster:

The CustomResourceDefinition "workloads.kueue.x-k8s.io" is invalid: spec.selectableFields[0].jsonPath: Invalid value: ".metadata.UID": is an invalid path: does not refer to a valid field
make: *** [Makefile:235: install] Error 1

I tested with // +kubebuilder:selectablefield:JSONPath=.spec.queueName which worked so I think the issue is with the metadata json path

Thank you for confirming that! As I can check impl, it seems to be correct...
https://github.com/kubernetes/kubernetes/blob/15a186a888dc2e908681c876e321468b8d32a37b/staging/src/k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/validation/validation.go#L830-L832

Lets's add annotation if @mimowo does not have any objections.

@mimowo
Copy link
Contributor

mimowo commented Feb 10, 2025

Lets's add annotation if @mimowo does not have any objections.

You mean the "UID" label? Sure we can do it, and mark it as immutable since UIDs are immutable.

@tenzen-y
Copy link
Member

Lets's add annotation if @mimowo does not have any objections.

You mean the "UID" label? Sure we can do it, and mark it as immutable since UIDs are immutable.

Yeah, I meant UID label.

@avrittrohwer
Copy link
Contributor Author

@mimowo @tenzen-y I took an initial pass at this in #4339. I haven't contributed to kubebuilder controllers before so let me know if this is the right general approach

@avrittrohwer
Copy link
Contributor Author

Any thoughts on how we want to handle MultiKueue remote workloads? Should the UID label just get dropped when we make a remote copy?

@mimowo
Copy link
Contributor

mimowo commented Mar 6, 2025

I see, why the question: because the UID from the worker cluster (present in the condition message of the preempted workload) would not be found anyway on the management cluster.

I think ideally:

  1. we set the new UID label on the management workload unconditionally
  2. when creating workloads on the worker clusters we inprint the information about the UID from the management cluster (for example in the MainUID label)
  3. when preemption on the worker cluster happens we put into the message the value of the MainWorkloadUID label which can be used for searching the workload in (1.). Maybe the message could be kueue-admission Preempted to accommodate a workload (UID: <Main UID>, Worker UID: <Worker UID>) due to fair sharing within the cohort. Here Worker UID would be logged only for workloads with "MainUID".

WDYT?

@mimowo
Copy link
Contributor

mimowo commented Mar 6, 2025

Thinking about it more. To be fair I would just do now 1. unconditionally, and defer the MK 2. and 3. as a future improvement.

@avrittrohwer
Copy link
Contributor Author

It is not that much more effort to add a kueue.x-k8s.io/multikueue-main-workload-uid label when copying over the workload to the remote cluster. How about we take that step in this change and leave the preemption logging change as a future improvement?

@avrittrohwer
Copy link
Contributor Author

Actually, in the name of keeping integration test changes more scoped in each PR, lets defer all the 'main-workload-uid' labeling on the remote objects to a follow-up change

@avrittrohwer
Copy link
Contributor Author

The naive approach of adding an update call in Workload Reconcile loop is negatively impacting the scheduability perf tests: #4339 (comment). It seems like increasing the Kueue qps and burst parameters could mitigate this.

Taking a step back however, is there a good reason in the kueue-admission Preempted to accommodate a workload (UID: <UID>) due to fair sharing within the cohort condition we only include the UID? We have access to the full Workload when we construct the condition: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/scheduler/preemption/preemption.go#L220. Why don't we just additionally log the preempting workload's namespace and name? Then users could very easily kubectl get the preempting workload without using label selector

@mimowo
Copy link
Contributor

mimowo commented Mar 7, 2025

Why don't we just additionally log the preempting workload's namespace and name?

Logging full name and namespace is ok. We actually have a PR which does it in 0.11: 7dcff89. I think we could justify cherry-picking it to 0.10.

However, when creating the message for the condition we only include UID to avoid leaking of potentially namespace-sensitive information which could be present just in the workload name. The analogous approach is taken for preemption in the core k8s (for events and condition): https://github.com/kubernetes/kubernetes/blob/0e2a2afc4c7687e71f2dddbb7edfd03f082c6876/pkg/scheduler/framework/preemption/preemption.go#L198

@avrittrohwer
Copy link
Contributor Author

Ah okay. Any ideas on an existing controller Workload update or patch call that we could use to also update the labels? I've run the scheduling perf tests a bunch and adding another update call is adding too much overhead to the workload controller loop

@avrittrohwer
Copy link
Contributor Author

Maybe I could update the defaulting workload webhook to also run on updates and do the label defaulting there? It is not possible on creates because the workload object does not have UID yet

@mimowo
Copy link
Contributor

mimowo commented Mar 7, 2025

adding another update call is adding too much overhead to the workload controller loop

How much of a hit it is? Maybe we could consider just swallowing the pill and bump the thresholds.

Maybe I could update the defaulting workload webhook to also run on updates and do the label defaulting there?

This actually sounds like it could add even more overhead (but maybe it would not be visible in the perf tests, cannot recall if they install webhooks).

@tenzen-y
Copy link
Member

tenzen-y commented Mar 7, 2025

@mimowo @avrittrohwer We should not record preemptor name and namespace in events and logging since those are PII.
So, I think we should revert 7dcff89 and keep UID in logging in this enhancement.

You can find a similar discussion: #1942 (comment)

@mimowo
Copy link
Contributor

mimowo commented Mar 7, 2025

It is not possible on creates because the workload object does not have UID yet

I know, I was hoping this will change with MAP, as then it all happens within API server, but it seems it will not be visible in the CEL rules (though I haven't seen explicit mention in the KEP).

@mimowo
Copy link
Contributor

mimowo commented Mar 7, 2025

So, I think we should revert 7dcff89 and keep UID in logging in this enhancement.

I think no reason to revert, this PR does exactly that - only adds to logging, but does not modify the message used on the condition. Also, we have integration tests showing the message format:

Message: fmt.Sprintf("Previously: Preempted to accommodate a workload (UID: %s) due to %s", alphaMidWl.UID, preemption.HumanReadablePreemptionReasons[kueue.InClusterQueueReason]),

@tenzen-y
Copy link
Member

tenzen-y commented Mar 7, 2025

So, I think we should revert 7dcff89 and keep UID in logging in this enhancement.

I think no reason to revert, this PR does exactly that - only adds to logging, but does not modify the message used on the condition. Also, we have integration tests showing the message format:

kueue/test/integration/singlecluster/scheduler/preemption_test.go

Line 300 in 099a239

Message: fmt.Sprintf("Previously: Preempted to accommodate a workload (UID: %s) due to %s", alphaMidWl.UID, preemption.HumanReadablePreemptionReasons[kueue.InClusterQueueReason]),

Ah, I see. It seems to focus only on logging. In that case, it should be fine. There are no leaks between tenants.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 7, 2025

Instead of kueue.x-k8s.io/multikueue-main-workload-uid label, why we can not add management or workload cluster UID as a field?
I meant wondering if we can add dedicated fields something like managemendUID to recorded management cluster workload UID.
IIUC, adding UID labeling has the motivation to easy list workloads, right? In that case, we can add it to dedicated fields.

@mimowo
Copy link
Contributor

mimowo commented Mar 7, 2025

Taking a step back - another option is that workloads already contain owner Job UID: kueue.x-k8s.io/job-uid. So, maybe in the messages for preemption we also record Job UID. Then, a user could use it for fast fetch. This would be no new requests, and no performance hit.

The only problem is that with Pod Groups. However, maybe this is a corner case. Also, we could for a pod group record UID of the "first" pod. This would already allow users to identify the preempting group/

@mimowo
Copy link
Contributor

mimowo commented Mar 7, 2025

wdyt @tenzen-y @avrittrohwer about the Job UID approach for the fast search?

@avrittrohwer
Copy link
Contributor Author

wdyt @tenzen-y @avrittrohwer about the Job UID approach for the fast search?

I don't think that would help, because we have the same problem of 'how do I get a job by its UID' which is not supported by kubernetes: kubernetes/kubernetes#20572

@mimowo
Copy link
Contributor

mimowo commented Mar 7, 2025

We will search workloads by JobUID, as it is already recorded on the workload object:

wl.Labels[controllerconsts.JobUIDLabel] = jobUID
.

Lavereging the fact at the moment of wokload creation we already have JobUID, but not workloadUID.
So we only need to let user know the JobUID (along with the Workload UID in preemption.go).

@avrittrohwer
Copy link
Contributor Author

Right, but once the user has JobUID, how do they kubectl get it by UID?

@avrittrohwer
Copy link
Contributor Author

avrittrohwer commented Mar 7, 2025

Oh they could do kubectl get workloads --selector=kueue.x-k8s.io/job-uid=<UID>. I was thinking how would they do kubectl get <jobAPI>

@avrittrohwer
Copy link
Contributor Author

Adding jobUID to the condition and event message is a much easier change. Lets do that

@avrittrohwer avrittrohwer linked a pull request Mar 7, 2025 that will close this issue
@tenzen-y
Copy link
Member

tenzen-y commented Mar 7, 2025

Taking a step back - another option is that workloads already contain owner Job UID: kueue.x-k8s.io/job-uid. So, maybe in the messages for preemption we also record Job UID. Then, a user could use it for fast fetch. This would be no new requests, and no performance hit.

The only problem is that with Pod Groups. However, maybe this is a corner case. Also, we could for a pod group record UID of the "first" pod. This would already allow users to identify the preempting group/

That sounds reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants