You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are observing high churn on DestinationRules and Rollouts work queues when running Argo Rollouts v1.7.2. Even after successfully promoting canaries, the work queue depth remains high in some of the clusters. Additionally, the rate of workqueue_adds has not decreased, particularly for DestinationRules and Services.
ArgoRollouts is running with the flag --rollout-resync=60 to prevent rollouts from getting stuck due to missed events.
To Reproduce
Run Argo Rollouts v1.7.2 with the flag --rollout-resync=60
Create/Update lots of canary enabled Rollout objects with Istio traffic shifting. ( We used 800 rollout objects - 200 rollout objects x 4 clusters).
Observe the behavior of work queues and the rate of workqueue_adds.
Expected behavior
After successfully promoting canaries, the work queue depth should return to a normal state.
The rate of workqueue_adds should decrease instead of remaining constant.
The system shouldn't observe 2-4x events per Rollout, DR, VS
Screenshots
rate(controller_clientset_k8s_request_total[1m])
Mutating requests only sent for about 10 minutes which is how long it took for all rollouts to auto-promote.
rate(workqueue_adds_total[1m])
Notice that DR and Services rate remained the same before and after the test. Is the workqueue getting rate limited?
Rollouts rate actually somehow dropped to lower than baseline?
Also, it seems to have stabilized to about 2x the number of rollout objects - it is steadily growing though.
All the queues except for Rollouts which is steadily growing are adding a constant amount of work every minute even though no further deployments are being made. Most likely an undiserable side-effect of --rollout-resync=60
rate(workqueue_retries_total[1m])
Loads of retries due to the controller encountering conflicts when trying to persist objects
2x the number of DRs
3x the number of Rollouts
rate(workqueue_unfinished_work_seconds[1m]
It eventually managed to reduce the unfinished work for DRs (took quite a while), but Rollout unfinished work remains roughly constant.
From the above:
I expect work to be added to the queue due to informers triggering refresh - but this shouldn't be causing this many retries. Nothing is being changed after all as shown by controller_clientset_k8s_request_total, so what is it retrying?
It is clashing with something, but with what?
Is it clashing with itself? Is it clashing on field ownership with argocd-controller? Something else entirely?
# Paste the logs from the rollout controller
# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts
# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered:
From argo-rollout logs, there was definitely a good number of conflicts during the stress test itself, but it dropped to 0 afterwards. Why are there still so many retries then?
Checklist:
Describe the bug
We are observing high churn on DestinationRules and Rollouts work queues when running Argo Rollouts v1.7.2. Even after successfully promoting canaries, the work queue depth remains high in some of the clusters. Additionally, the rate of workqueue_adds has not decreased, particularly for DestinationRules and Services.
ArgoRollouts is running with the flag
--rollout-resync=60
to prevent rollouts from getting stuck due to missed events.To Reproduce
Expected behavior
Screenshots
rate(controller_clientset_k8s_request_total[1m])
rate(workqueue_adds_total[1m])
--rollout-resync=60
rate(workqueue_retries_total[1m])
rate(workqueue_unfinished_work_seconds[1m]
From the above:
I expect work to be added to the queue due to informers triggering refresh - but this shouldn't be causing this many retries. Nothing is being changed after all as shown by controller_clientset_k8s_request_total, so what is it retrying?
It is clashing with something, but with what?
Is it clashing with itself? Is it clashing on field ownership with argocd-controller? Something else entirely?
informer writebacks could explain some of these clashes - fix: remove ReplicaSet write-back by zachaller · Pull Request #4044 · argoproj/argo-rollouts (fixed in 1.8.0, we will give it a try and repeat)
It seems like there are 2x or 3x multiplications happening on DR and Rollouts. Possibly because the informers have secondary indexes that could cause excess work. Potentially related to argo-rollouts performance gradually degrades until controller restart #2855
Version
v1.7.2
Logs
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: