Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Churn on DestinationRules and Rollouts Work Queues #4140

Open
2 tasks done
nebojsa-prodana opened this issue Feb 19, 2025 · 2 comments
Open
2 tasks done

High Churn on DestinationRules and Rollouts Work Queues #4140

nebojsa-prodana opened this issue Feb 19, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@nebojsa-prodana
Copy link

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug

We are observing high churn on DestinationRules and Rollouts work queues when running Argo Rollouts v1.7.2. Even after successfully promoting canaries, the work queue depth remains high in some of the clusters. Additionally, the rate of workqueue_adds has not decreased, particularly for DestinationRules and Services.

ArgoRollouts is running with the flag --rollout-resync=60 to prevent rollouts from getting stuck due to missed events.

To Reproduce

  • Run Argo Rollouts v1.7.2 with the flag --rollout-resync=60
  • Create/Update lots of canary enabled Rollout objects with Istio traffic shifting. ( We used 800 rollout objects - 200 rollout objects x 4 clusters).
  • Observe the behavior of work queues and the rate of workqueue_adds.

Expected behavior

  • After successfully promoting canaries, the work queue depth should return to a normal state.
  • The rate of workqueue_adds should decrease instead of remaining constant.
  • The system shouldn't observe 2-4x events per Rollout, DR, VS

Screenshots

rate(controller_clientset_k8s_request_total[1m])

Image

  • Mutating requests only sent for about 10 minutes which is how long it took for all rollouts to auto-promote.

rate(workqueue_adds_total[1m])
Image

  • Notice that DR and Services rate remained the same before and after the test. Is the workqueue getting rate limited?
  • Rollouts rate actually somehow dropped to lower than baseline?
  • Also, it seems to have stabilized to about 2x the number of rollout objects - it is steadily growing though.
  • All the queues except for Rollouts which is steadily growing are adding a constant amount of work every minute even though no further deployments are being made. Most likely an undiserable side-effect of --rollout-resync=60

rate(workqueue_retries_total[1m])
Image

  • Loads of retries due to the controller encountering conflicts when trying to persist objects
  • 2x the number of DRs
  • 3x the number of Rollouts

rate(workqueue_unfinished_work_seconds[1m]
Image

  • It eventually managed to reduce the unfinished work for DRs (took quite a while), but Rollout unfinished work remains roughly constant.

From the above:

I expect work to be added to the queue due to informers triggering refresh - but this shouldn't be causing this many retries. Nothing is being changed after all as shown by controller_clientset_k8s_request_total, so what is it retrying?

Version

v1.7.2

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@nebojsa-prodana nebojsa-prodana added the bug Something isn't working label Feb 19, 2025
@nebojsa-prodana
Copy link
Author

From argo-rollout logs, there was definitely a good number of conflicts during the stress test itself, but it dropped to 0 afterwards. Why are there still so many retries then?

Image

@nebojsa-prodana
Copy link
Author

Possibly related to: #4073

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant