-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set Meta of Restarted Job as Unused #687
Conversation
because whenever a new allocation is started (for the same runner), it is already handled as fresh in Poseidon. With this PR we persist this information to Nomad, to allow a minimal improved runner recovery behavior when such runners are still counted as idle.
ca61ebf
to
31299f4
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #687 +/- ##
==========================================
+ Coverage 76.17% 76.60% +0.42%
==========================================
Files 43 43
Lines 3631 3642 +11
==========================================
+ Hits 2766 2790 +24
+ Misses 632 623 -9
+ Partials 233 229 -4 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for looking into this 🦋
I acknowledge your concerns and agree that there could be further potential to improve. However, it might not be worth the effort right now given the other issues we are looking at.
With this approach, we completely rely on the
PreviousAllocation
field.
That's true, and is fine for me as an improvement.
However, we disregard that idle runners can also be restarted, leading to an unnecessary request in this approach.
You mean because we are calling setRunnerMetaUsed(r, false, 0)
(and setting the value) without checking the previous ConfigMetaUsedKey
? Sure, that could be optimized, but shouldn't be too bad, I'd say.
Also, the
PreviousAllocation
might not be a reliable indicator for all restarted/rescheduled/migrated allocations. It might be possible that we do not reset the used field for some restarted/.. runners.
That's true. For the current knowledge, however, it is the best approach we have to update the meta data. I still feel that the PR improves the situation and if we learn about occasions where it is not sufficient, we can still fix the behavior.
As an alternative, we could invest more time in determining the restart/.. indicators. This might allow resetting the Used-Meta field of more runners, which would minimally improve the timing of recovery scenarios (startup, network outages, nomad outages).
Would you recommend that? I'd say that "minimally improving the timing of recovery scenarios" is nice, but currently not the highest priority.
to send a second request to Nomad only when the meta values change.
Yeah. How you summarized it, it seemed so easy that I've added it with b42819b.
No. I would agree 🦋 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for the additional improvement -- this really looks great! 🌸
because whenever a new allocation is started (for the same runner), it is already handled as fresh in Poseidon. With this PR we persist this information to Nomad, to allow a minimally improved runner recovery behavior when such runners are still counted as idle.
Closes #621
We do not have sufficient knowledge about indicators in the allocation data to identify restarted/rescheduled/migrated allocations.
With this approach, we completely rely on the
PreviousAllocation
field. If it is set, the allocation must be restarted and is, therefore, unused. However, we disregard that idle runners can also be restarted, leading to an unnecessary request in this approach. Also, thePreviousAllocation
might not be a reliable indicator for all restarted/rescheduled/migrated allocations. It might be possible that we do not reset the used field for some restarted/.. runners.As an alternative, we could invest more time in determining the restart/.. indicators. This might allow resetting the Used-Meta field of more runners, which would minimally improve the timing of recovery scenarios (startup, network outages, nomad outages).