Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Prevent version skews when upgrading cluster #299

Merged
merged 3 commits into from
Jan 28, 2025

Conversation

ludwighansson
Copy link

Description

Hello,

We have encountered issues when upgrading clusters where agent nodes are upgraded before the API servers, thus violating the Kubernetes kubelet versioning skew policy which states:

kubelet must not be newer than kube-apiserver.

Combining this violation with the configuration option rke2_wait_for_all_pods_to_be_ready=true, the role execution will get stuck and eventually fail as all non-static pods will go into a CreateContainerConfigError state with an event from kubelet like:

Error: services have not yet been read at least once, cannot construct envvars

This change ensures that rolling restarts are done on server nodes first to ensure all API servers has been upgraded prior to any agents being restarted to prevent this versioning skew violation.

It is basically a duplicate of the existing include_task block for rolling_restart.yml, where each block leverages the rke2_servers_group_name and rke2_agents_group_name for targeting server and agent nodes respectively.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Small minor change not affecting the Ansible Role code (GitHub Actions Workflow, Documentation etc.)

How Has This Been Tested?

This has been tested by uplifting Kubernetes from 1.30.3+rke2r1 to 1.31.4+rke2r1 in a 12-node cluster (3 servers, 9 agents), as well as a smaller cluster with 1 server and 3 agent nodes. After patching the role, all servers were upgraded first, with agents following after, resulting in a successful upgrade with rke2_wait_for_all_pods_to_be_ready=true.

Perform rolling restarts on server nodes first when upgrading,
to prevent version skews between the kube-apiserver running on
the server nodes, and kubelet on agent nodes.
@MonolithProjects MonolithProjects self-assigned this Jan 24, 2025
@MonolithProjects
Copy link
Collaborator

Hi @ludwighansson , thanks for this PR. It looks good, but it should take into account also setup with a single node.

@MonolithProjects MonolithProjects added the enhancement New feature or request label Jan 24, 2025
@ludwighansson
Copy link
Author

ludwighansson commented Jan 24, 2025

Ahh yes, I totally missed single node installations, sorry!

I will look into fixing that within the next few days!

Ludwig Hansson added 2 commits January 26, 2025 12:51
Default to an empty list on the with_items key when resolving hosts
in the agent group.
If a host belongs to both agent and server groups, ensure it isn't
restarted twice.
@ludwighansson
Copy link
Author

I updated the with_items key to default to an empty list in case the agent group is undefined. It seems to pass my single node tests now!

btw, is it supported to have a host in both server and agent groups at the same time? (eg. for single node setup?)
I added a when condition on the agent block to ensure it isn't in the server group too, to prevent it from being restarted twice. What do you think?

@MonolithProjects
Copy link
Collaborator

@ludwighansson in a single node setup the node is supposed to be in the server group. But i guess it's fine that you added the exception for agent block (to be hones i never tried how the role would behave inf the single node would be in the both groups).

Copy link
Collaborator

@MonolithProjects MonolithProjects left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@MonolithProjects MonolithProjects merged commit e3cfef0 into lablabs:main Jan 28, 2025
5 checks passed
@MonolithProjects MonolithProjects added bug Something isn't working and removed enhancement New feature or request labels Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants