You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 3, 2021. It is now read-only.
There are a few places in the antidote scheduler code where we need to wait for certain, parallel tasks to complete. In general this is done by waiting by spawning goroutines with a waitgroup, and blocking until some kind of exit criteria, like the waitgroup being emptied, or some kind of timeout indicating failure.
The pod network health checks use a context with timeout that seems to work pretty well, and matches what we're trying to do.
This seems to work, as these two instances already seem to have reliable exit conditions and non-spammy spans, but it might be worth considering adopting a consistent approach across the board.
My main concern here is that with the for loop approach, any sleeps we inject in the loop between checks will compound, making it really hard to determine exactly how long a given portion of the code will wait for. In theory, a check which retries 600 times and pauses for a second in between tries will be roughly 600 seconds, but in practice it's usually significantly more than that. The context with timeout will allow us to reason in terms of actual time waiting, rather than number of retries, which I think makes more sense in the antidote config.
Tbh, this is really not a huge priority - the status quo works well, and the answer may be "no it's not worth changing", but I think it's a worthwhile exercise to at least consider the implications if anyone is curious to poke around the scheduler a bit.
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
There are a few places in the antidote scheduler code where we need to wait for certain, parallel tasks to complete. In general this is done by waiting by spawning goroutines with a waitgroup, and blocking until some kind of exit criteria, like the waitgroup being emptied, or some kind of timeout indicating failure.
The pod network health checks use a context with timeout that seems to work pretty well, and matches what we're trying to do.
However, the basic pod status as well as the wait for job success uses a for loop with a counter.
This seems to work, as these two instances already seem to have reliable exit conditions and non-spammy spans, but it might be worth considering adopting a consistent approach across the board.
My main concern here is that with the for loop approach, any sleeps we inject in the loop between checks will compound, making it really hard to determine exactly how long a given portion of the code will wait for. In theory, a check which retries 600 times and pauses for a second in between tries will be roughly 600 seconds, but in practice it's usually significantly more than that. The context with timeout will allow us to reason in terms of actual time waiting, rather than number of retries, which I think makes more sense in the antidote config.
Tbh, this is really not a huge priority - the status quo works well, and the answer may be "no it's not worth changing", but I think it's a worthwhile exercise to at least consider the implications if anyone is curious to poke around the scheduler a bit.
The text was updated successfully, but these errors were encountered: