-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate worker (Celery?) issues #2697
Comments
AFAICT it only affects short-running workers.
This matches the period of time when no tasks were being processed on that particular worker. There was no such message on the other worker yesterday, but the "idle time" was very similar, and 1799 is almost exactly 30 minutes, it could be just a coincidence, but it seems a bit suspicious to me (but I suppose it could be an interval in which Celery checks for non-responsive workers). |
I checked the images and there were no updates of |
I was releasing Packit and I had to tag in sidetags both ogr and specfile. Ogr was quick:
Specfile took almost 30 minutes to react:
|
So we are hitting this on stage as well? |
yes, exactly! |
This are the most suspicious messages I found in the logs for the short-running-worker that didn't process any task in 30 minutes.
And this is the strange log for the long-running-worker not responding to heartbeat check.
Shouldn't sync_from_downstream be disabled? |
I double-checked also other packages that could potentially affect this (resulting in celery, gevent, eventlet, greenlet, billiard, kombu, redis, amqp) and haven't found any difference in version. |
After discussion with @majamassarini I added an alert that could help detect this: https://gitlab.cee.redhat.com/osci/monitoring/-/merge_requests/557 |
While working on #2693 I spotted this for short running pod 2 that I think is related with this card. And something similar for short running 0 (see below). In both cases the alert rule was not raised (I think because we are detecting it just when this happen for all the workers at the same time, in these case it happened just for a single worker).
The worker 2 was restarted by OpenShift and in worker 0 just Celery restarted. short running worker 0 log:
|
thanks @majamassarini for the logs! I will have a better look into the missed heartbeat issue.
makes sense, will try to adjust the alerting to account for this |
I added the alert for single short-running workers: https://gitlab.cee.redhat.com/osci/monitoring/-/merge_requests/561, this seems to work and detected one occurrence of worker not receiving events a while ago (the affected worker was During that, I tried to run few diagnostic commands from which it appears that the worker was healthy:
For comparison, here are outputs of some of these commands ran on worker that was processing events normally:
A few minutes later, the same idle state occurred for short-running-2, and what's strange is that the graph indicates tasks were being constantly executed during this "idle" time: ![]() As for the 30-minute restart time, I think that comes from valkey timeout setting here. Until we have something more, we could try lowering this timeout to help with earlier recovery. Additionally, I found the --max-tasks-per-child setting in the Celery documentation, which we could experiment with also as a workaround for potential memory leaks, related to #2522. |
I see that the graph indicates that there was a gap in tasks execution, and it was long 30 minutes, for the short running worker 1. I can't see anything for the worker 2 so I am wondering if in the alert rule we should increase the time range? |
Judging from the posted graphs, we don't get over 8 concurrently ran tasks per worker, but we have
|
About the I do think we have a Celery issue, but now it is no more related with memory or cpu limits. There is a strange correlation between CPU usage / COPR metrics and Celery misbehaving. Seems like that when we en-queue many copr builds than the CPU usage increases and Celery starts loosing heartbeats. The other strange thing is that when we restart the pods, enqueued COPR builds drop and pod start behaving well again. One fun thing is that there is no correlation between queued copr builds and started/ended copr builds. There seem to exist one more correlation between processed events / CPU usage / and pods misbehaving but honestly I can not explain it and I can't find any evidence in the logs. I would probably: |
not this day, but I have seen 16 used sometimes. I think that having 4 workers somehow helps in not needing so much concurrency. |
Previously, metrics could be pushed multiple times within a single process_message task. This could potentially lead to increased CPU usage, redundant network requests, and metric inconsistencies. Related to packit#2430 packit#2697
Previously, metrics could be pushed multiple times within a single process_message task. This could potentially lead to increased CPU usage, redundant network requests, and metric inconsistencies. Related to packit#2430 packit#2697
Push metrics once per process_message task Previously, metrics could be pushed multiple times within a single process_message task. This could potentially lead to increased CPU usage, redundant network requests, and metric inconsistencies. Related to #2430 #2697 Reviewed-by: Matej Focko Reviewed-by: Maja Massarini
Previously, `cls.pushgateway` was used in `process_message`, possibly causing race conditions in multithreaded Celery tasks, leading to inaccurate metric reporting. Related to packit#2430 packit#2697
I implemented #2722 and #2723, which seem to be fixing the metrics (#2430). As described in #2723, the very high values of queued Copr builds were caused by multiple tasks pushing metrics concurrently using the same SteveJobs instance in a multithreaded Celery environment. With concurrency set to 16, each thread could have pushed metrics multiple times, and when the load was higher (higher CPU), it could have potentially caused e.g. network congestion and additional load (interacting with the pushgateway from multiple threads at the same time). I’m not yet sure if this is the underlying issue with the "idle workers", so I will continue to monitor the behavior now that the pushgateway code is fixed. |
After the previous week redeployment (6th January), we have started hitting issues with jobs processing, causing tasks not being processed for some time and delays:
Substantial drift from celery@packit-worker-long-running-0 may mean clocks are out of sync. Current drift is 1799 seconds. [orig: 2025-01-14 14:51:59.656603 recv: 2025-01-14 14:22:00.484181]
consumer: Connection to broker lost. Trying to re-establish the connection...
followed by a restartThe text was updated successfully, but these errors were encountered: