-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a way to force reboot a node #110
Comments
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days). |
/reopen This is still a valid feature request |
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days). |
I suppose this could be done with a more flexible reboot command. |
This is implemented now, and will be released with next kured image (in the meantime, you can build your own container if you prefer, from our main branch). |
@evrardjp Do you mind pointing me to the PR that fixes this issue? I see PRs for changing the command but this is about forcing a reboot even if we are not in the correct time range, there are k8s warnings, etc. |
I am sorry, I missed the part where it was also outside the maintenance window. If it's inside the maintenance window, we indeed implemented a force Reboot, regardless of any drain/cordon failures. What you are looking for is not implemented, sorry. Could you clarify why you would want to force outside the maintenance window? Maybe there is an alternative: You set up your maintenance window to be always open, but set up a blocker when you don't want to reboot? Alternatively, if you want to force the reboot, regardless of k8s success/failure, and you needed to write on /var/run/reboot-required , I suppose you would connect on the host. Why not triggering a reboot there directly then? |
One reason is mentioned in the PR description of #21 Personally, I would use it mainly in the case where my host stops being able to schedule new pods (happens if the kubelet starts OOM'ing because of misconfiguration, the host gets temporarily blacklisted from ceph so all the pods need a full restart (hard to do without rebooting the host as you can't unmount existing volumes as they hang), etc) |
So, in those cases, you don't even need to drain/cordon, right? (While you can do something on the API, if kubelet is dead, it's kinda pointless). We don't have code for that (yet?). If you are still okay with trying to drain/cordon first, I think the forceReboot we have implemented is good enough: if drain fails, it ignores it and reboots anyway. The reboot is not scheduling new pods. |
If you don't think it's right, don't hesitate to reopen this. |
If the reason why the failures are happening are because of the kubelet OOM'ing, it might not be able to drain. For us, it would fail when kubelet tries to read all of the disk stats simultaneously that pushed us over the limit so things like draining still works for a while after each OOM. I don't think force reboot fixes this, right? The issue is you can't force kured to start the reboot process of draining, etc. I think part of the confusion comes from this being called "force reboot" (it was named before there was a feature called "force reboot" which is not the same thing) The feature request is to add a way to bypass the checks you do for whether the node can be rebooted. i.e. If the host is having severe problems, you want to reboot even if it is outside the maintenance window or if there are prometheus alerts Also, I can't reopen this, I don't have permission to. |
forcereboot ignores the drain errors, so yeah that would (maybe) work. I am saying maybe, because the OOMkiller might want to kill kured, and for that there is nothing we can do if both kubelet and kured are killed. If it doesn't kill kured, then I don't see why it wouldn't work: the drain wouldn't stop on error, and the reboot would continue its way. But all of that applies during maintenance window. WHich is why I mentioned to have large maintenance window (=always happy to reboot) plus a way to block using prometheus when you aren't ready to reboot. You might also be interested in some new design here #359 |
Yes, this is what this issue is about. Adding a way to bypass the maintenance window when certain criteria are met. It seems like the rewrite for #359, could possibly also include slightly more complex rule evaluation logic (or at least be structured in a way that allows it to be implemented later). e.g:
|
Probably https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/ is a better way to do this. |
To do that, the script would have to implement locking like kured to make sure that two nodes aren't shutdown at the same time. |
I think this paves the way for a new "kured" :) |
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days). |
Sometimes reboots need to be forced on a node. It would be nice if there was a way to force the node to restart outside of
start-time
/end-time
,blocking-pod-selector
, etcMaybe if
/var/run/reboot-required
contains the textforce
?The text was updated successfully, but these errors were encountered: