-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto Rescheduling Default Interval/Window Incorrect #893
Comments
adjust_check_scheduling() is intended to smooth out load by more evenly scheduling checks, but wasn't updated to work with the new heap based scheduling queue. These changes closely replicate the original implementation while using the new data structures reasonably efficiently, and providing sub-second resolution when calculating new event run times. The rescheduling algorithm makes some assumptions about per-check overhead that may be overly pessimistic, and possibly not needed to generate a smooth schedule. When rescheduled, the next run of an event may be earlier or later than dictated by its check interval, but will run at its regular check interval when no schedule adjustment is needed.
Previously we were looking at timed_event.run_time which has second precision. This would cause rescheduling to be run only when events occured in the same second. By looking at squeue_event.when we get the actual run times used by the event scheduling priority queue with microsecond precision.
Thanks for reaching out. Unfortunately, the answer right now is that I'm not sure - the code you were referencing is from >=4 maintainers ago and I haven't gotten deep into check scheduling recently. In practice I've run some pretty large environments where this didn't seem to happen - even if nothing handles this case explicitly, there might be some implicit stuff in the auto-rescheduling where we "get lucky" and don't continuously procrastinate on checks. I have some vague ideas about why it might be fine but I'd rather read the code and give you a real answer instead of telling you some nonsense. If I may ask, what prompted you to dig into this? Did you see a check (or checks) in your environment that get continuously rescheduled? |
Thanks for following up Sebastian @sawolf ! I think I got the idea that this could be a problem in the following page, where they note that there are potential problems with the default values, but the new values they recommend dont seem to fully solve the problem, conceptually (See section "The check is failing to be scheduled or executed"): https://nagios.force.com/support/s/article/Last-Check-Time-Not-Updating-4f7efc76 |
Hi @sawolf have you had a chance to look into this further? |
Hi @blevans33 - no, I haven't had a chance to get into this yet. |
Hi @sawolf , this is serious issue, even I observed this issue. Please have a look into it ASAP. |
It seems to me that the following two values should be equal in the default config file:
auto_rescheduling_interval
auto_rescheduling_window
Otherwise, if the window is larger than the interval, there is in theory nothing from stopping a particular check from continuously getting rescheduled to the back of the window (If interval is 30 and window is 45, only the checks rescheduled for the next 30sec are guaranteed to be checked in the upcoming interval, whereas the checks in the final 15sec of the window can be rescheduled AGAIN!)
Is there anything wrong with setting interval=30, window=30?
More info: https://support.nagios.com/forum/viewtopic.php?f=7&t=65475
The text was updated successfully, but these errors were encountered: