High Churn on DestinationRules and Rollouts Work Queues #4140

nebojsa-prodana · 2025-02-19T11:52:53Z

Checklist:

I've included steps to reproduce the bug.
I've included the version of argo rollouts.

Describe the bug

We are observing high churn on DestinationRules and Rollouts work queues when running Argo Rollouts v1.7.2. Even after successfully promoting canaries, the work queue depth remains high in some of the clusters. Additionally, the rate of workqueue_adds has not decreased, particularly for DestinationRules and Services.

ArgoRollouts is running with the flag --rollout-resync=60 to prevent rollouts from getting stuck due to missed events.

To Reproduce

Run Argo Rollouts v1.7.2 with the flag --rollout-resync=60
Create/Update lots of canary enabled Rollout objects with Istio traffic shifting. ( We used 800 rollout objects - 200 rollout objects x 4 clusters).
Observe the behavior of work queues and the rate of workqueue_adds.

Expected behavior

After successfully promoting canaries, the work queue depth should return to a normal state.
The rate of workqueue_adds should decrease instead of remaining constant.
The system shouldn't observe 2-4x events per Rollout, DR, VS

Screenshots

rate(controller_clientset_k8s_request_total[1m])

Mutating requests only sent for about 10 minutes which is how long it took for all rollouts to auto-promote.

rate(workqueue_adds_total[1m])

Notice that DR and Services rate remained the same before and after the test. Is the workqueue getting rate limited?
Rollouts rate actually somehow dropped to lower than baseline?
Also, it seems to have stabilized to about 2x the number of rollout objects - it is steadily growing though.
All the queues except for Rollouts which is steadily growing are adding a constant amount of work every minute even though no further deployments are being made. Most likely an undiserable side-effect of --rollout-resync=60

rate(workqueue_retries_total[1m])

Loads of retries due to the controller encountering conflicts when trying to persist objects
2x the number of DRs
3x the number of Rollouts

rate(workqueue_unfinished_work_seconds[1m]

It eventually managed to reduce the unfinished work for DRs (took quite a while), but Rollout unfinished work remains roughly constant.

From the above:

I expect work to be added to the queue due to informers triggering refresh - but this shouldn't be causing this many retries. Nothing is being changed after all as shown by controller_clientset_k8s_request_total, so what is it retrying?

It is clashing with something, but with what?
Is it clashing with itself? Is it clashing on field ownership with argocd-controller? Something else entirely?
informer writebacks could explain some of these clashes - fix: remove ReplicaSet write-back by zachaller · Pull Request #4044 · argoproj/argo-rollouts (fixed in 1.8.0, we will give it a try and repeat)
It seems like there are 2x or 3x multiplications happening on DR and Rollouts. Possibly because the informers have secondary indexes that could cause excess work. Potentially related to argo-rollouts performance gradually degrades until controller restart #2855

Version

v1.7.2

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

nebojsa-prodana · 2025-02-19T11:57:43Z

From argo-rollout logs, there was definitely a good number of conflicts during the stress test itself, but it dropped to 0 afterwards. Why are there still so many retries then?

nebojsa-prodana · 2025-02-27T18:00:36Z

Possibly related to: #4073

nebojsa-prodana added the bug Something isn't working label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Churn on DestinationRules and Rollouts Work Queues #4140

High Churn on DestinationRules and Rollouts Work Queues #4140

nebojsa-prodana commented Feb 19, 2025

nebojsa-prodana commented Feb 19, 2025

nebojsa-prodana commented Feb 27, 2025

High Churn on DestinationRules and Rollouts Work Queues #4140

High Churn on DestinationRules and Rollouts Work Queues #4140

Comments

nebojsa-prodana commented Feb 19, 2025

nebojsa-prodana commented Feb 19, 2025

nebojsa-prodana commented Feb 27, 2025