Skip to content
This repository has been archived by the owner on Dec 3, 2021. It is now read-only.

Consider extending context timer concept to both job scheduling and podready #187

Open
Mierdin opened this issue Jun 10, 2020 · 0 comments

Comments

@Mierdin
Copy link
Member

Mierdin commented Jun 10, 2020

There are a few places in the antidote scheduler code where we need to wait for certain, parallel tasks to complete. In general this is done by waiting by spawning goroutines with a waitgroup, and blocking until some kind of exit criteria, like the waitgroup being emptied, or some kind of timeout indicating failure.

The pod network health checks use a context with timeout that seems to work pretty well, and matches what we're trying to do.

However, the basic pod status as well as the wait for job success uses a for loop with a counter.

This seems to work, as these two instances already seem to have reliable exit conditions and non-spammy spans, but it might be worth considering adopting a consistent approach across the board.
My main concern here is that with the for loop approach, any sleeps we inject in the loop between checks will compound, making it really hard to determine exactly how long a given portion of the code will wait for. In theory, a check which retries 600 times and pauses for a second in between tries will be roughly 600 seconds, but in practice it's usually significantly more than that. The context with timeout will allow us to reason in terms of actual time waiting, rather than number of retries, which I think makes more sense in the antidote config.

Tbh, this is really not a huge priority - the status quo works well, and the answer may be "no it's not worth changing", but I think it's a worthwhile exercise to at least consider the implications if anyone is curious to poke around the scheduler a bit.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant