-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty traefik/provider.yaml
on worker nodes
#4197
Comments
I have exactly the same issue. |
Same for me with a fresh cluster (v1.28.1) with 3 master and 2 worker nodes. One of the workers had after joining the cluster an empy |
I only have one worker node. ...traefik/providers.yaml is empty. I'm not entirely sure how this file is supposed to look. Ah I see it, you provided one. /var/snap/microk8s/current/args/traefik/provider-template.yaml
Workaround: Just copy this over into the "provider.yaml" file. in the same directory. then run
which should apply the config. Worked for me :P Per Documentation: https://microk8s.io/docs/configuring-services
|
Similar issue, the workers are suddenly reported as Not Ready when I restarted them, v1.28.1. Turns out the provider.yaml is empty, but a bit late for me as I already re-joined one of the worker nodes with fresh microk8s snap package. Seems a serious issue for production environment. |
Same situation during upgrade from 1.27.6 to 1.28.2 on a worker node after the upgrade of the dqlite nodes |
Same here with fresh installation of microk8s 1.28.3/stable. It reappears randomly, Only occurs for some worker nodes while other worker nodes works fine. I'll observe more |
I have the same situvation, and updating the "provider.yaml" with APISERVER endpoints did not fixed the issue. I have a six node HA cluster. When any of the datastore nodes are shutdown, other nodes moves to NotReady status. This happens occasionally xaa@ha-02: Every 2.0s: kubectl get nodes ha-03: Fri Feb 2 12:30:04 2024 NAME STATUS ROLES AGE VERSION It takes 15 to 20 mints to recover automatically, and the applications are not accessiable during this window. systemctl stop snap.microk8s.daemon-k8s-dqlite; sleep 2; systemctl start snap.microk8s.daemon-k8s-dqlite Any fix for this issue..? |
I just want to bump this and keep this issue alive. I've had to manually reapply that providers.yaml on one worker-node over a dozen times now over the last 7-8 months. If anyone has any idea of the root cause of this, such if something else is causing this, that would be great to know. |
FWIW, this seems to happen occasionally on worker nodes when dqlite nodes reboot. I found this code which manipulates |
Ok I think I fixed this. [Update 5-9-24] UGH no. I didn't fix it, It's still broken as ever. As a workaround, I've built a cronjob that basically checks that file every 5 mins will take the provider.yaml and stuff a fresh copy in there and reboot microk8s on the worker node if the provider.yaml file disappears.
|
Hi. still got same problem on v1.29.7 |
Little update. We got some network outage. and our nodes go through about 15-20 minutest cycle NotReady-Ready every time one of control planes drops. And changing /var/snap/microk8s/current/args/ha-conf to one failure-domain for all controlplanes seems fixing the issue (firstly there was two zones as we have 2 DC near by) So now only one node that dropped due to network is not ready. and all workers not changed status at all. if I return second failure-domain problems starts again. Not completely shure what's going on, but whis issue may be related to failure-domains |
Confirming same issue here. Looks like the traefik/provider.yaml is losing data. |
I have just had a case of The logs at the time were absolutely quiet, and all there was is this
I've tried looking at the code but so far could not find the place that is responsible for this |
Woke up this morning to an empty HA Cluster, 3 control-planes and 7 workers. Same logs as observed before
I do however observe this log line on one of my control-planes
|
Thank you all for your time and effort in bringing attention to this issue and for sharing your workarounds. I sincerely apologize for the inconvenience and the fact that this issue is still not resolved. We are currently working on identifying a solution, and we will provide updates as soon as we have more information. |
I suspect this line to be responsible for this issue. |
canonical/microk8s-cluster-agent@b83af6d
Hopefully with the proposed fix we won't see this issue anymore. The change will soon be promoted to stable. |
Thank you @HomayoonAlimohammadi ! By the way, how can one identify the updates that are released on the stable channel? |
Hi @sbidoul! Sorry that I completely missed your question. |
Summary
Worker nodes sometimes loose access to api server and become Not Ready, likely during and after restart of control plane nodes.
The situation is that
traefik/provider.yaml
is present but empty. Restoringtraefik/provider.yaml
by copying it from another worker node and doingsnap restart microk8s
is sufficient to recover the worker node.Reproduction Steps
We can't reproduce reliably but the problem occurs regularly (it did in 1.25, and persists after upgrading the cluster to 1.27).
It seems to happen when we restart dqlite nodes, or when they are upgraded.
Here is a the worker node log when it starts failing:
Introspection Report
Introspection report on the failing worker node available if needed.
The text was updated successfully, but these errors were encountered: