You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We noticed that completed jobs are not being cleaned up. We currently have the job-ttl argument set to 5m in our configuration. I believe the configuration sets the .spec.ttlSecondsAfterFinished value on the job. It looks like this was introduced in Kubernetes 1.23. Unfortunately, we are not able to update K8S as rapidly. As a result, pods continue to pile up in our cluster requiring us to either create a cron to clean them up or to manually delete them.
An approach I've seen from Github Actions Kubernetes Runners is to have the controller watch for completed Jobs and clean them up manually.
I've attached an image from one of namespaces running the agents showing the pods continuing to exist past 5 minutes.
The text was updated successfully, but these errors were encountered:
I think we will need to build a job cleanup function - aside from older k8s versions, there are other ways jobs can accumulate (e.g. they create successfully but can fail to start a pod for some reason, and sit around retrying forever).
@DrJosh9000 Upon further investigation, I think I found the source of the issue. We are using the default queue for our main build agent.
The monitor requires a queue tag to exist within the configuration. So we naturally added queue=default to the config file.
However, when the InformerFactory gets created, it grabs sets up the informer to watch for jobs with all of the tags from the configuration.
Consequently, when a job comes in for the default queue, it's obviously missing the agent.queue tag. So it doesn't get created with the tags.buildkite.com/queue label based on the scheduler code.
We noticed that completed jobs are not being cleaned up. We currently have the
job-ttl
argument set to5m
in our configuration. I believe the configuration sets the.spec.ttlSecondsAfterFinished
value on the job. It looks like this was introduced in Kubernetes 1.23. Unfortunately, we are not able to update K8S as rapidly. As a result, pods continue to pile up in our cluster requiring us to either create a cron to clean them up or to manually delete them.An approach I've seen from Github Actions Kubernetes Runners is to have the controller watch for completed Jobs and clean them up manually.
I've attached an image from one of namespaces running the agents showing the pods continuing to exist past 5 minutes.
The text was updated successfully, but these errors were encountered: