fix: Prevent version skews when upgrading cluster #299

ludwighansson · 2025-01-23T17:31:26Z

Description

Hello,

We have encountered issues when upgrading clusters where agent nodes are upgraded before the API servers, thus violating the Kubernetes kubelet versioning skew policy which states:

kubelet must not be newer than kube-apiserver.

Combining this violation with the configuration option rke2_wait_for_all_pods_to_be_ready=true, the role execution will get stuck and eventually fail as all non-static pods will go into a CreateContainerConfigError state with an event from kubelet like:

Error: services have not yet been read at least once, cannot construct envvars

This change ensures that rolling restarts are done on server nodes first to ensure all API servers has been upgraded prior to any agents being restarted to prevent this versioning skew violation.

It is basically a duplicate of the existing include_task block for rolling_restart.yml, where each block leverages the rke2_servers_group_name and rke2_agents_group_name for targeting server and agent nodes respectively.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
Small minor change not affecting the Ansible Role code (GitHub Actions Workflow, Documentation etc.)

How Has This Been Tested?

This has been tested by uplifting Kubernetes from 1.30.3+rke2r1 to 1.31.4+rke2r1 in a 12-node cluster (3 servers, 9 agents), as well as a smaller cluster with 1 server and 3 agent nodes. After patching the role, all servers were upgraded first, with agents following after, resulting in a successful upgrade with rke2_wait_for_all_pods_to_be_ready=true.

Perform rolling restarts on server nodes first when upgrading, to prevent version skews between the kube-apiserver running on the server nodes, and kubelet on agent nodes.

MonolithProjects · 2025-01-24T09:30:24Z

Hi @ludwighansson , thanks for this PR. It looks good, but it should take into account also setup with a single node.

ludwighansson · 2025-01-24T11:17:02Z

Ahh yes, I totally missed single node installations, sorry!

I will look into fixing that within the next few days!

Default to an empty list on the with_items key when resolving hosts in the agent group.

If a host belongs to both agent and server groups, ensure it isn't restarted twice.

ludwighansson · 2025-01-26T15:17:12Z

I updated the with_items key to default to an empty list in case the agent group is undefined. It seems to pass my single node tests now!

btw, is it supported to have a host in both server and agent groups at the same time? (eg. for single node setup?)
I added a when condition on the agent block to ensure it isn't in the server group too, to prevent it from being restarted twice. What do you think?

MonolithProjects · 2025-01-27T16:01:14Z

@ludwighansson in a single node setup the node is supposed to be in the server group. But i guess it's fine that you added the exception for agent block (to be hones i never tried how the role would behave inf the single node would be in the both groups).

MonolithProjects

LGTM. Thanks

fix: Prevent version skews when upgrading cluster

1505365

Perform rolling restarts on server nodes first when upgrading, to prevent version skews between the kube-apiserver running on the server nodes, and kubelet on agent nodes.

Jonher937 approved these changes Jan 23, 2025

View reviewed changes

hamps-contrib approved these changes Jan 24, 2025

View reviewed changes

MonolithProjects self-assigned this Jan 24, 2025

MonolithProjects added the enhancement New feature or request label Jan 24, 2025

Ludwig Hansson added 2 commits January 26, 2025 12:51

Fix undefined agent variable in single node setups

585209c

Default to an empty list on the with_items key when resolving hosts in the agent group.

Ensure hosts cant be restarted twice

8bfbe50

If a host belongs to both agent and server groups, ensure it isn't restarted twice.

MonolithProjects approved these changes Jan 28, 2025

View reviewed changes

MonolithProjects merged commit e3cfef0 into lablabs:main Jan 28, 2025
5 checks passed

MonolithProjects mentioned this pull request Jan 29, 2025

bug: rke2 upgrade, agent nodes should be upgraded after all the master nodes #102

Closed

MonolithProjects added bug Something isn't working and removed enhancement New feature or request labels Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Prevent version skews when upgrading cluster #299

fix: Prevent version skews when upgrading cluster #299

ludwighansson commented Jan 23, 2025

MonolithProjects commented Jan 24, 2025

ludwighansson commented Jan 24, 2025 •

edited

Loading

ludwighansson commented Jan 26, 2025

MonolithProjects commented Jan 27, 2025

MonolithProjects left a comment

fix: Prevent version skews when upgrading cluster #299

fix: Prevent version skews when upgrading cluster #299

Conversation

ludwighansson commented Jan 23, 2025

Description

Type of change

How Has This Been Tested?

MonolithProjects commented Jan 24, 2025

ludwighansson commented Jan 24, 2025 • edited Loading

ludwighansson commented Jan 26, 2025

MonolithProjects commented Jan 27, 2025

MonolithProjects left a comment

Choose a reason for hiding this comment

ludwighansson commented Jan 24, 2025 •

edited

Loading