Completed Jobs Not Being Cleaned Up #420

CH-BrianJurgess · 2024-11-14T18:19:33Z

We noticed that completed jobs are not being cleaned up. We currently have the job-ttl argument set to 5m in our configuration. I believe the configuration sets the .spec.ttlSecondsAfterFinished value on the job. It looks like this was introduced in Kubernetes 1.23. Unfortunately, we are not able to update K8S as rapidly. As a result, pods continue to pile up in our cluster requiring us to either create a cron to clean them up or to manually delete them.

An approach I've seen from Github Actions Kubernetes Runners is to have the controller watch for completed Jobs and clean them up manually.

I've attached an image from one of namespaces running the agents showing the pods continuing to exist past 5 minutes.

The text was updated successfully, but these errors were encountered:

DrJosh9000 · 2024-11-27T06:31:08Z

Thanks for raising this @CH-BrianJurgess!

I think we will need to build a job cleanup function - aside from older k8s versions, there are other ways jobs can accumulate (e.g. they create successfully but can fail to start a pod for some reason, and sit around retrying forever).

CH-BrianJurgess · 2025-01-27T23:57:02Z

@DrJosh9000 Upon further investigation, I think I found the source of the issue. We are using the default queue for our main build agent.

The monitor requires a queue tag to exist within the configuration. So we naturally added queue=default to the config file.

However, when the InformerFactory gets created, it grabs sets up the informer to watch for jobs with all of the tags from the configuration.

Consequently, when a job comes in for the default queue, it's obviously missing the agent.queue tag. So it doesn't get created with the tags.buildkite.com/queue label based on the scheduler code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completed Jobs Not Being Cleaned Up #420

Completed Jobs Not Being Cleaned Up #420

CH-BrianJurgess commented Nov 14, 2024

DrJosh9000 commented Nov 27, 2024

CH-BrianJurgess commented Jan 27, 2025

Completed Jobs Not Being Cleaned Up #420

Completed Jobs Not Being Cleaned Up #420

Comments

CH-BrianJurgess commented Nov 14, 2024

DrJosh9000 commented Nov 27, 2024

CH-BrianJurgess commented Jan 27, 2025