-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Prevent version skews when upgrading cluster #299
Conversation
Perform rolling restarts on server nodes first when upgrading, to prevent version skews between the kube-apiserver running on the server nodes, and kubelet on agent nodes.
Hi @ludwighansson , thanks for this PR. It looks good, but it should take into account also setup with a single node. |
Ahh yes, I totally missed single node installations, sorry! I will look into fixing that within the next few days! |
Default to an empty list on the with_items key when resolving hosts in the agent group.
If a host belongs to both agent and server groups, ensure it isn't restarted twice.
I updated the btw, is it supported to have a host in both server and agent groups at the same time? (eg. for single node setup?) |
@ludwighansson in a single node setup the node is supposed to be in the server group. But i guess it's fine that you added the exception for agent block (to be hones i never tried how the role would behave inf the single node would be in the both groups). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks
Description
Hello,
We have encountered issues when upgrading clusters where agent nodes are upgraded before the API servers, thus violating the Kubernetes kubelet versioning skew policy which states:
Combining this violation with the configuration option
rke2_wait_for_all_pods_to_be_ready=true
, the role execution will get stuck and eventually fail as all non-static pods will go into aCreateContainerConfigError
state with an event from kubelet like:This change ensures that rolling restarts are done on server nodes first to ensure all API servers has been upgraded prior to any agents being restarted to prevent this versioning skew violation.
It is basically a duplicate of the existing
include_task
block forrolling_restart.yml
, where each block leverages therke2_servers_group_name
andrke2_agents_group_name
for targeting server and agent nodes respectively.Type of change
How Has This Been Tested?
This has been tested by uplifting Kubernetes from 1.30.3+rke2r1 to 1.31.4+rke2r1 in a 12-node cluster (3 servers, 9 agents), as well as a smaller cluster with 1 server and 3 agent nodes. After patching the role, all servers were upgraded first, with agents following after, resulting in a successful upgrade with
rke2_wait_for_all_pods_to_be_ready=true
.