-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add reboot-required
annotation
#715
Conversation
3a15eec
to
272c48b
Compare
reboot-required
annotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR. I added one smaller comment above.
I think it would make sense to only add/remove this annotation when the --annotate-nodes
flag is set. This would streamline the new annotation with the others.
939f18a
to
53e009a
Compare
Put the new annotation behind the annotation toggle. |
53e009a
to
c7b7d6a
Compare
@ckotzbauer @dholbach I pushed the changes you requested. |
@ckotzbauer @dholbach Can you please have a look? |
@timo-42 I have not forgotten, but I am a bit busy at the moment. |
@timo-42 Thank you for this feature. I am very interested in this. Does it work as expected? When I understand the code correctly, then annotateNodes is only executed during the reboot time window. Outside reboot time window:
But when it is in the reboot time window, reboot is required (and it is possible to acquire the lock) then the drain and reboots follows quickly:
After the reboot:
In my opinion, this defeats the purpose of your issue #702. I think for this to work, a separate goroutine would be needed like the maintainRebootRequiredMetric goroutine, which updates the kured_reboot_required metric accordingly outside of the reboot time window. Please let me know if it works for you as expected and if I have misunderstood something. FYI: I opened #725 to discuss the issue of the "restart required" log message outside the restart window. |
c7b7d6a
to
bacb687
Compare
You're right. I did not consider checking the state outside the window! Moved the code to a separate goroutine. |
I'm not a developer per se, but I think it would have advantages for the memory footprint to have only one goroutine. Especially since they partially execute the same commands. |
Signed-off-by: Timo Haas <[email protected]>
bacb687
to
11937fe
Compare
Moved code into metric goroutine. |
@vitality411 @ckotzbauer Is there anything more I should do? |
Hi @timo-42, from my side it looks good. |
@timo-42 Thank you for your patience during the review process, I'm back online and able to do this responsively going forward. As I understand it, what you'd like to accomplish here is to not schedule certain workloads on nodes that are running kured, because those nodes will responsively reboot, and cordon/drain is not sufficient to gracefully evict and reschedule those workloads. Is that correct? If so, I wonder if this issue would be a more elegant solution: #737 My observation: there is a problem with using a node annotation via the kured daemonset to identify "I am running the kured daemonset", because if for any reason that daemonset is removed manually, and/or the appropriate node taints are added so that it is no longer running kured, that annotation will have to be manually removed (and inevitably this will not happen 100% of the time, and so that annotation will not be 100% accurate in reflecting the state of that node). |
Yes. We want to not schedule new workload on it and want to be notified that a node needs a reboot, so we can start draining the pods in their respective maintenance windows.
If I understand this correctly, the proposal is to
If the concern is cleanup, than we could add a pre delete hook to delete all dangling annotations. |
@timo-42 Thanks for the explanation. Here's my thinking: by putting the annotation on the node as soon as kured launches its daemonset runtime you are essentially declaring: don't schedule any non-evictable workloads on this node because I may be rebooted at some point in the future. And then you would begin the process of evicting existing workloads, which may take days. But it's not clear to me how having this advance notice will actually guarantee that the underlying host OS will not put a It would seem to me that the workload requirements you're outlining would forbid kured from running at all on the nodes that are running those workloads. What is the downside of simply tainting such nodes and not having kured run on it. I'm not sure that kured is as it currently is a viable solution for these operational requirements. @dholbach @ckotzbauer @evrardjp any other thoughts? |
This is easy :) - I use |
@timo-42 Thanks for clarifying, that makes sense! In this case this would be my suggestion (curious for others' thoughts as well): Rather than add a third annotation to a node to describe kured reboot state, let's simply re-order the steps in the flow where we annotate If a reboot is required:
The above should solve for your use-case. You may use the @ckotzbauer @dholbach @evrardjp if they have any criticisms to the above |
I like the idea @jackfrancis. I'm not fully aware of the current implementation of |
@ckotzbauer In my view it wouldn't be a breaking change. The original implementation of It seems that the blocking pod selector criteria was overlooked when implementing the tl;dr I think there's a bug in the implementation of the @timo-42 does that make sense? I'd be happy to fix that bug myself. |
@jackfrancis This is exactly what we need 👍 . Thanks for the inside regarding the annotation and taking care of this bug yourself. I will than change our internal code to use |
Closing since #749 got merged. |
#702