Empty `traefik/provider.yaml` on worker nodes #4197

sbidoul · 2023-09-08T08:11:04Z

Summary

Worker nodes sometimes loose access to api server and become Not Ready, likely during and after restart of control plane nodes.

The situation is that traefik/provider.yaml is present but empty. Restoring traefik/provider.yaml by copying it from another worker node and doing snap restart microk8s is sufficient to recover the worker node.

Reproduction Steps

We can't reproduce reliably but the problem occurs regularly (it did in 1.25, and persists after upgrading the cluster to 1.27).
It seems to happen when we restart dqlite nodes, or when they are upgraded.

Here is a the worker node log when it starts failing:

Sep 08 05:00:01 odoo-k8s-test-worker-3 microk8s.daemon-kubelite[2506]: E0908 05:00:01.609453    2506 controller.go:193] "Failed to update lease" err="Put \"https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/odoo-k8s-test-worker-3?timeout=10s\":>
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: 2023/09/08 05:00:24 updating endpoints
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: 2023/09/08 05:00:24 Config file changed on disk, will restart proxy
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: Error: proxy failed: failed to load configuration: empty list of control plane endpoints
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: Usage:
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:    apiserver-proxy [flags]
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: Flags:
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:   -h, --help                        help for apiserver-proxy
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:       --kubeconfig string           path to kubeconfig file to use for updating list of known control plane nodes (default "/var/snap/microk8s/5891/credentials/kubelet.config")
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:       --refresh-interval duration   refresh interval (default 30s)
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:       --traefik-config string       path to apiserver proxy config file (default "/var/snap/microk8s/5891/args/traefik/traefik.yaml")

Introspection Report

Introspection report on the failing worker node available if needed.

The text was updated successfully, but these errors were encountered:

i6-xx · 2023-09-28T18:54:20Z

I have exactly the same issue.
Fresh installation (v1.28.1) with 18 nodes.
3 nodes had empty provider.yaml , coping it from another node and restart helped

costigator · 2023-10-07T07:43:14Z

Same for me with a fresh cluster (v1.28.1) with 3 master and 2 worker nodes. One of the workers had after joining the cluster an empy /var/snap/microk8s/current/args/traefik/provider.yaml file. After copying the content of it from the other worker node and restarting the service snap restart microk8s it startet working.

Mbd06b · 2023-10-19T20:32:34Z

I only have one worker node. ...traefik/providers.yaml is empty. I'm not entirely sure how this file is supposed to look.

Ah I see it, you provided one.
Located in:

/var/snap/microk8s/current/args/traefik/provider-template.yaml

tcp:
  routers:
    Router-1:
      rule: "HostSNI(`*`)"
      service: "kube-apiserver"
      tls:
        passthrough: true
  services:
    kube-apiserver:
      loadBalancer:
        servers:
# APISERVERS
#      - address: "10.130.0.2:16443"
#      - address: "10.130.0.3:16443"
#      - address: "10.130.0.4:16443"

Workaround: Just copy this over into the "provider.yaml" file. in the same directory.
Then uncomment the servers, updating the address with the local IPs of the control-plane nodes on your network.

then run

sudo snap stop microk8s
sudo snap start microk8s

which should apply the config.

Worked for me :P

Per Documentation: https://microk8s.io/docs/configuring-services

snap.microk8s.daemon-traefik and snap.microk8s.daemon-apiserver-proxy
The traefik and apiserver-proxy daemons are used in worker nodes to as a proxy to all API server control plane endpoints. The traefik daemon was replaced by the apiserver proxy in 1.25+ releases.

The most significant configuration option for both daemons is the API server endpoints found in ${SNAP_DATA}/args/traefik/provider.yaml. For apiserver-proxy daemon (1.25+ on wards) the refresh frequency of the available control plane endpoints can be set in ${SNAP_DATA}/args/apiserver-proxy via the --refresh-interval parameter.

pampie · 2023-10-22T22:19:09Z

Similar issue, the workers are suddenly reported as Not Ready when I restarted them, v1.28.1. Turns out the provider.yaml is empty, but a bit late for me as I already re-joined one of the worker nodes with fresh microk8s snap package.

Seems a serious issue for production environment.

adrienpeiffer · 2023-10-26T11:01:46Z

Same situation during upgrade from 1.27.6 to 1.28.2 on a worker node after the upgrade of the dqlite nodes

xinstein · 2023-11-23T00:03:56Z

Same here with fresh installation of microk8s 1.28.3/stable.

It reappears randomly, Only occurs for some worker nodes while other worker nodes works fine.

I'll observe more

DileepAP · 2024-02-05T08:40:45Z

I have the same situvation, and updating the "provider.yaml" with APISERVER endpoints did not fixed the issue.
kubernetes versions i used was 1.28.3 and 1.29.0

I have a six node HA cluster.
microk8s status
microk8s is running
high-availability: yes
datastore master nodes: 10.40.101.83:19001 10.40.101.185:19001 10.40.101.186:19001
datastore standby nodes: 10.40.101.85:19001 10.40.101.128:19001 10.40.101.129:19001

When any of the datastore nodes are shutdown, other nodes moves to NotReady status. This happens occasionally

xaa@ha-02:$ date
Fri 2 Feb 12:28:23 UTC 2024
xaa@ha-02:$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ha-05 Ready 66m v1.29.0 10.40.101.128 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15
ha-01 Ready 83m v1.29.0 10.40.101.83 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15
ha-06 Ready 61m v1.29.0 10.40.101.129 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15
ha-04 Ready 71m v1.29.0 10.40.101.186 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15
ha-03 Ready 75m v1.29.0 10.40.101.185 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15
ha-02 Ready 78m v1.29.0 10.40.101.85 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15
xaa@ha-02:~$

Every 2.0s: kubectl get nodes ha-03: Fri Feb 2 12:30:04 2024

NAME STATUS ROLES AGE VERSION
ha-03 Ready 76m v1.29.0
ha-02 Ready 79m v1.29.0
ha-06 NotReady 62m v1.29.0
ha-01 NotReady 85m v1.29.0
ha-04 NotReady 73m v1.29.0
ha-05 NotReady 68m v1.29.0

It takes 15 to 20 mints to recover automatically, and the applications are not accessiable during this window.

systemctl stop snap.microk8s.daemon-k8s-dqlite; sleep 2; systemctl start snap.microk8s.daemon-k8s-dqlite
Restarting the dqlite service on the nodes that shows "NotReady", can bring the node to "Ready" state quickly

Any fix for this issue..?

Mbd06b · 2024-04-23T13:19:12Z

I just want to bump this and keep this issue alive. I've had to manually reapply that providers.yaml on one worker-node over a dozen times now over the last 7-8 months.

If anyone has any idea of the root cause of this, such if something else is causing this, that would be great to know.

sbidoul · 2024-04-23T14:25:30Z

FWIW, this seems to happen occasionally on worker nodes when dqlite nodes reboot.

I found this code which manipulates provider.yaml but this seems to be used on join only?

Mbd06b · 2024-04-23T14:38:56Z

Does the configuration here. I KNOW it needs to be present in the worker-node in order for the node to connect and show "Ready".

Does this providers.yaml need to be applied in all the master-node args for it to persist to the worker node during resets? Or would this disrupt the functioning of the nodes on the control plane? Should I try it? Thoughts?

Mbd06b · 2024-05-08T18:43:30Z

Ok I think I fixed this. [Update 5-9-24] UGH no. I didn't fix it, It's still broken as ever. As a workaround, I've built a cronjob that basically checks that file every 5 mins will take the provider.yaml and stuff a fresh copy in there and reboot microk8s on the worker node if the provider.yaml file disappears.

The issue was that I had inconsistent traefik provider.yaml args on my control-nodes (aka nodes on the control plane)

/var/snap/microk8s/current/args/traefik/provider.yaml

So I went through each of my control nodes (4) and applied the SAME provider.yaml on each control node, and then restarted microk8s from the master node.

Then did a "sudo snap microk8s stop && sudo snap microk8s start" on the affected worker node.
The worker node got the provider.yaml as I defined them on the control nodes.

I'm hoping that this is going to be persistent. If I'm back here in two weeks I'll let you know if it worked.

Ochita · 2024-08-14T18:56:49Z

Hi. still got same problem on v1.29.7
14 servers with 6 of them is control planes.

Ochita · 2024-08-21T08:20:08Z

Little update. We got some network outage. and our nodes go through about 15-20 minutest cycle NotReady-Ready every time one of control planes drops. And changing /var/snap/microk8s/current/args/ha-conf to one failure-domain for all controlplanes seems fixing the issue (firstly there was two zones as we have 2 DC near by) So now only one node that dropped due to network is not ready. and all workers not changed status at all. if I return second failure-domain problems starts again.

Not completely shure what's going on, but whis issue may be related to failure-domains

lahcim · 2024-09-08T00:43:53Z

Confirming same issue here. Looks like the traefik/provider.yaml is losing data.
Possibly this happens in case reboot of the nodes coincides with refresh interval updating the file and file is not being correctly saved?
Can we possibly make this refresh interval logic more resilient to ensure accidental reboots during refresh do not end up in empty file after reboot?

sbidoul · 2024-09-18T10:49:11Z

I have just had a case of provider.yaml becoming empty on a worker node.

The logs at the time were absolutely quiet, and all there was is this

microk8s.daemon-apiserver-proxy[50117]: 2024/09/18 12:22:00 updating endpoints
microk8s.daemon-apiserver-proxy[50117]: 2024/09/18 12:22:00 Config file changed on disk, will restart proxy
microk8s.daemon-apiserver-proxy[50117]: Error: proxy failed: failed to load configuration: empty list of control plane endpoints

I've tried looking at the code but so far could not find the place that is responsible for this updating endpoints message. My hypothesis is that there is some incorrect error handling code there that leaves the file empty.

abdavid · 2024-10-16T08:11:31Z

Woke up this morning to an empty provider.yaml on one of my worker nodes.

HA Cluster, 3 control-planes and 7 workers.

Same logs as observed before

microk8s.daemon-apiserver-proxy[252757]: 2024/10/16 00:10:32 updating endpoints
microk8s.daemon-apiserver-proxy[252757]: 2024/10/16 00:10:32 Config file changed on disk, will restart proxy
microk8s.daemon-apiserver-proxy[252757]: Error: proxy failed: failed to load configuration: empty list of control plane endpoints
microk8s.daemon-apiserver-proxy[252757]: 2024/10/16 00:10:32 proxy failed: accept tcp [::]:16443: use of closed network connection

I do however observe this log line on one of my control-planes

microk8s.daemon-kubelite[820226]: W1016 00:10:31.058031  820226 lease.go:265] Resetting endpoints for master service "kubernetes" to [192.168.x.x 192.168.x.x]

HomayoonAlimohammadi · 2024-10-17T13:18:21Z

Thank you all for your time and effort in bringing attention to this issue and for sharing your workarounds. I sincerely apologize for the inconvenience and the fact that this issue is still not resolved. We are currently working on identifying a solution, and we will provide updates as soon as we have more information.

HomayoonAlimohammadi · 2024-10-21T12:41:43Z

I suspect this line to be responsible for this issue.
Unfortunately I was not able to reproduce this scenario so far. Even tried with a 13 node cluster (3 CP,s 10 workers, Microk8s v1.28) running on LXD containers and did a bunch of chaotic things to the cluster (restarting containers, removing, rejoining, etc.), yet still the problem didn't surface.
We'll start working towards a fix and post further updates here.

Fixes canonical/microk8s#4197

HomayoonAlimohammadi · 2024-10-25T10:03:44Z

Hopefully with the proposed fix we won't see this issue anymore. The change will soon be promoted to stable.
Let us know if you were still experiencing this issue.

sbidoul · 2024-10-25T10:11:38Z

Thank you @HomayoonAlimohammadi !

By the way, how can one identify the updates that are released on the stable channel?

Fixes canonical/microk8s#4197

HomayoonAlimohammadi · 2024-12-12T11:20:01Z

Hi @sbidoul! Sorry that I completely missed your question.
Unfortunately it's a bit hard to tell, but any snap of 1.28 up to 1.31 should contain the fix.

HomayoonAlimohammadi mentioned this issue Oct 24, 2024

Fix write to provider.yaml file canonical/microk8s-cluster-agent#62

Merged

HomayoonAlimohammadi closed this as completed in canonical/microk8s-cluster-agent@b83af6d Oct 25, 2024

HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62)

ff9b3b7

Fixes canonical/microk8s#4197

HomayoonAlimohammadi mentioned this issue Oct 25, 2024

[Backport to 1.27] Fix empty traefik/provider.yaml canonical/microk8s-cluster-agent#63

Merged

HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62)

7c6dcf3

Fixes canonical/microk8s#4197

HomayoonAlimohammadi mentioned this issue Oct 25, 2024

[Backport to 1.28] Fix empty traefik/provider.yaml canonical/microk8s-cluster-agent#64

Merged

HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62)

7fc6f34

Fixes canonical/microk8s#4197

HomayoonAlimohammadi mentioned this issue Oct 25, 2024

[Backport to 1.29] Fix empty traefik/provider.yaml canonical/microk8s-cluster-agent#65

Merged

HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62)

1fb2ea9

Fixes canonical/microk8s#4197

HomayoonAlimohammadi mentioned this issue Oct 25, 2024

[Backport to 1.30] Fix empty traefik/provider.yaml canonical/microk8s-cluster-agent#66

Merged

HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62)

88a9cd2

Fixes canonical/microk8s#4197

HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62)

11fb7f4

Fixes canonical/microk8s#4197

HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62)

ecd1428

Fixes canonical/microk8s#4197

HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62)

52e6573

Fixes canonical/microk8s#4197

HomayoonAlimohammadi mentioned this issue Oct 25, 2024

[Backport to 1.31] Fix empty traefik/provider.yaml canonical/microk8s-cluster-agent#67

Merged

bschimke95 pushed a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62) (#63)

71a3b61

Fixes canonical/microk8s#4197

bschimke95 pushed a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62) (#64)

66ca8ad

Fixes canonical/microk8s#4197

bschimke95 pushed a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62) (#65)

6c7d3f9

Fixes canonical/microk8s#4197

bschimke95 pushed a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62) (#66)

c97905c

Fixes canonical/microk8s#4197

bschimke95 pushed a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024

Fix empty traefik/provider.yaml (#62) (#67)

9835902

Fixes canonical/microk8s#4197

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty `traefik/provider.yaml` on worker nodes #4197

Empty `traefik/provider.yaml` on worker nodes #4197

sbidoul commented Sep 8, 2023 •

edited

Loading

i6-xx commented Sep 28, 2023

costigator commented Oct 7, 2023

Mbd06b commented Oct 19, 2023 •

edited

Loading

pampie commented Oct 22, 2023

adrienpeiffer commented Oct 26, 2023

xinstein commented Nov 23, 2023

DileepAP commented Feb 5, 2024

Mbd06b commented Apr 23, 2024 •

edited

Loading

sbidoul commented Apr 23, 2024

Mbd06b commented Apr 23, 2024

Mbd06b commented May 8, 2024 •

edited

Loading

Ochita commented Aug 14, 2024

Ochita commented Aug 21, 2024

lahcim commented Sep 8, 2024

sbidoul commented Sep 18, 2024

abdavid commented Oct 16, 2024

HomayoonAlimohammadi commented Oct 17, 2024

HomayoonAlimohammadi commented Oct 21, 2024

HomayoonAlimohammadi commented Oct 25, 2024

sbidoul commented Oct 25, 2024 •

edited

Loading

HomayoonAlimohammadi commented Dec 12, 2024

Empty traefik/provider.yaml on worker nodes #4197

Empty traefik/provider.yaml on worker nodes #4197

Comments

sbidoul commented Sep 8, 2023 • edited Loading

Summary

Reproduction Steps

Introspection Report

i6-xx commented Sep 28, 2023

costigator commented Oct 7, 2023

Mbd06b commented Oct 19, 2023 • edited Loading

pampie commented Oct 22, 2023

adrienpeiffer commented Oct 26, 2023

xinstein commented Nov 23, 2023

DileepAP commented Feb 5, 2024

Mbd06b commented Apr 23, 2024 • edited Loading

sbidoul commented Apr 23, 2024

Mbd06b commented Apr 23, 2024

Mbd06b commented May 8, 2024 • edited Loading

Ochita commented Aug 14, 2024

Ochita commented Aug 21, 2024

lahcim commented Sep 8, 2024

sbidoul commented Sep 18, 2024

abdavid commented Oct 16, 2024

HomayoonAlimohammadi commented Oct 17, 2024

HomayoonAlimohammadi commented Oct 21, 2024

HomayoonAlimohammadi commented Oct 25, 2024

sbidoul commented Oct 25, 2024 • edited Loading

HomayoonAlimohammadi commented Dec 12, 2024

Empty `traefik/provider.yaml` on worker nodes #4197

Empty `traefik/provider.yaml` on worker nodes #4197

sbidoul commented Sep 8, 2023 •

edited

Loading

Mbd06b commented Oct 19, 2023 •

edited

Loading

Mbd06b commented Apr 23, 2024 •

edited

Loading

Mbd06b commented May 8, 2024 •

edited

Loading

sbidoul commented Oct 25, 2024 •

edited

Loading