Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty traefik/provider.yaml on worker nodes #4197

Closed
sbidoul opened this issue Sep 8, 2023 · 21 comments
Closed

Empty traefik/provider.yaml on worker nodes #4197

sbidoul opened this issue Sep 8, 2023 · 21 comments

Comments

@sbidoul
Copy link

sbidoul commented Sep 8, 2023

Summary

Worker nodes sometimes loose access to api server and become Not Ready, likely during and after restart of control plane nodes.

The situation is that traefik/provider.yaml is present but empty. Restoring traefik/provider.yaml by copying it from another worker node and doing snap restart microk8s is sufficient to recover the worker node.

Reproduction Steps

We can't reproduce reliably but the problem occurs regularly (it did in 1.25, and persists after upgrading the cluster to 1.27).
It seems to happen when we restart dqlite nodes, or when they are upgraded.

Here is a the worker node log when it starts failing:

Sep 08 05:00:01 odoo-k8s-test-worker-3 microk8s.daemon-kubelite[2506]: E0908 05:00:01.609453    2506 controller.go:193] "Failed to update lease" err="Put \"https://127.0.0.1:16443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/odoo-k8s-test-worker-3?timeout=10s\":>
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: 2023/09/08 05:00:24 updating endpoints
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: 2023/09/08 05:00:24 Config file changed on disk, will restart proxy
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: Error: proxy failed: failed to load configuration: empty list of control plane endpoints
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: Usage:
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:    apiserver-proxy [flags]
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]: Flags:
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:   -h, --help                        help for apiserver-proxy
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:       --kubeconfig string           path to kubeconfig file to use for updating list of known control plane nodes (default "/var/snap/microk8s/5891/credentials/kubelet.config")
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:       --refresh-interval duration   refresh interval (default 30s)
Sep 08 05:00:24 odoo-k8s-test-worker-3 microk8s.daemon-apiserver-proxy[2922]:       --traefik-config string       path to apiserver proxy config file (default "/var/snap/microk8s/5891/args/traefik/traefik.yaml")

Introspection Report

Introspection report on the failing worker node available if needed.

@i6-xx
Copy link

i6-xx commented Sep 28, 2023

I have exactly the same issue.
Fresh installation (v1.28.1) with 18 nodes.
3 nodes had empty provider.yaml , coping it from another node and restart helped

@costigator
Copy link

Same for me with a fresh cluster (v1.28.1) with 3 master and 2 worker nodes. One of the workers had after joining the cluster an empy /var/snap/microk8s/current/args/traefik/provider.yaml file. After copying the content of it from the other worker node and restarting the service snap restart microk8s it startet working.

@Mbd06b
Copy link

Mbd06b commented Oct 19, 2023

I only have one worker node. ...traefik/providers.yaml is empty. I'm not entirely sure how this file is supposed to look.

Ah I see it, you provided one.
Located in:

/var/snap/microk8s/current/args/traefik/provider-template.yaml

tcp:
  routers:
    Router-1:
      rule: "HostSNI(`*`)"
      service: "kube-apiserver"
      tls:
        passthrough: true
  services:
    kube-apiserver:
      loadBalancer:
        servers:
# APISERVERS
#      - address: "10.130.0.2:16443"
#      - address: "10.130.0.3:16443"
#      - address: "10.130.0.4:16443"

Workaround: Just copy this over into the "provider.yaml" file. in the same directory.
Then uncomment the servers, updating the address with the local IPs of the control-plane nodes on your network.

then run

sudo snap stop microk8s
sudo snap start microk8s 

which should apply the config.

Worked for me :P

Per Documentation: https://microk8s.io/docs/configuring-services

snap.microk8s.daemon-traefik and snap.microk8s.daemon-apiserver-proxy
The traefik and apiserver-proxy daemons are used in worker nodes to as a proxy to all API server control plane endpoints. The traefik daemon was replaced by the apiserver proxy in 1.25+ releases.

The most significant configuration option for both daemons is the API server endpoints found in ${SNAP_DATA}/args/traefik/provider.yaml. For apiserver-proxy daemon (1.25+ on wards) the refresh frequency of the available control plane endpoints can be set in ${SNAP_DATA}/args/apiserver-proxy via the --refresh-interval parameter.

@pampie
Copy link

pampie commented Oct 22, 2023

Similar issue, the workers are suddenly reported as Not Ready when I restarted them, v1.28.1. Turns out the provider.yaml is empty, but a bit late for me as I already re-joined one of the worker nodes with fresh microk8s snap package.

Seems a serious issue for production environment.

@adrienpeiffer
Copy link

Same situation during upgrade from 1.27.6 to 1.28.2 on a worker node after the upgrade of the dqlite nodes

@xinstein
Copy link

Same here with fresh installation of microk8s 1.28.3/stable.

It reappears randomly, Only occurs for some worker nodes while other worker nodes works fine.

I'll observe more

@DileepAP
Copy link

DileepAP commented Feb 5, 2024

I have the same situvation, and updating the "provider.yaml" with APISERVER endpoints did not fixed the issue.
kubernetes versions i used was 1.28.3 and 1.29.0

I have a six node HA cluster.
microk8s status
microk8s is running
high-availability: yes
datastore master nodes: 10.40.101.83:19001 10.40.101.185:19001 10.40.101.186:19001
datastore standby nodes: 10.40.101.85:19001 10.40.101.128:19001 10.40.101.129:19001

When any of the datastore nodes are shutdown, other nodes moves to NotReady status. This happens occasionally

xaa@ha-02:$ date
Fri 2 Feb 12:28:23 UTC 2024
xaa@ha-02:
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ha-05 Ready 66m v1.29.0 10.40.101.128 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15
ha-01 Ready 83m v1.29.0 10.40.101.83 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15
ha-06 Ready 61m v1.29.0 10.40.101.129 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15
ha-04 Ready 71m v1.29.0 10.40.101.186 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15
ha-03 Ready 75m v1.29.0 10.40.101.185 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15
ha-02 Ready 78m v1.29.0 10.40.101.85 Red Hat Enterprise Linux 8.9 (Ootpa) 4.18.0-513.11.1.el8_9.x86_64 containerd://1.6.15
xaa@ha-02:~$

Every 2.0s: kubectl get nodes ha-03: Fri Feb 2 12:30:04 2024

NAME STATUS ROLES AGE VERSION
ha-03 Ready 76m v1.29.0
ha-02 Ready 79m v1.29.0
ha-06 NotReady 62m v1.29.0
ha-01 NotReady 85m v1.29.0
ha-04 NotReady 73m v1.29.0
ha-05 NotReady 68m v1.29.0

It takes 15 to 20 mints to recover automatically, and the applications are not accessiable during this window.

systemctl stop snap.microk8s.daemon-k8s-dqlite; sleep 2; systemctl start snap.microk8s.daemon-k8s-dqlite
Restarting the dqlite service on the nodes that shows "NotReady", can bring the node to "Ready" state quickly

Any fix for this issue..?

@Mbd06b
Copy link

Mbd06b commented Apr 23, 2024

I just want to bump this and keep this issue alive. I've had to manually reapply that providers.yaml on one worker-node over a dozen times now over the last 7-8 months.

If anyone has any idea of the root cause of this, such if something else is causing this, that would be great to know.

@sbidoul
Copy link
Author

sbidoul commented Apr 23, 2024

FWIW, this seems to happen occasionally on worker nodes when dqlite nodes reboot.

I found this code which manipulates provider.yaml but this seems to be used on join only?

@Mbd06b
Copy link

Mbd06b commented Apr 23, 2024

Does the configuration here. I KNOW it needs to be present in the worker-node in order for the node to connect and show "Ready".
image

Does this providers.yaml need to be applied in all the master-node args for it to persist to the worker node during resets? Or would this disrupt the functioning of the nodes on the control plane? Should I try it? Thoughts?

@Mbd06b
Copy link

Mbd06b commented May 8, 2024

Ok I think I fixed this. [Update 5-9-24] UGH no. I didn't fix it, It's still broken as ever. As a workaround, I've built a cronjob that basically checks that file every 5 mins will take the provider.yaml and stuff a fresh copy in there and reboot microk8s on the worker node if the provider.yaml file disappears.

The issue was that I had inconsistent traefik provider.yaml args on my control-nodes (aka nodes on the control plane)

/var/snap/microk8s/current/args/traefik/provider.yaml

So I went through each of my control nodes (4) and applied the SAME provider.yaml on each control node, and then restarted microk8s from the master node.

Then did a "sudo snap microk8s stop && sudo snap microk8s start" on the affected worker node.
The worker node got the provider.yaml as I defined them on the control nodes.

I'm hoping that this is going to be persistent. If I'm back here in two weeks I'll let you know if it worked.

@Ochita
Copy link

Ochita commented Aug 14, 2024

Hi. still got same problem on v1.29.7
14 servers with 6 of them is control planes.

@Ochita
Copy link

Ochita commented Aug 21, 2024

Little update. We got some network outage. and our nodes go through about 15-20 minutest cycle NotReady-Ready every time one of control planes drops. And changing /var/snap/microk8s/current/args/ha-conf to one failure-domain for all controlplanes seems fixing the issue (firstly there was two zones as we have 2 DC near by) So now only one node that dropped due to network is not ready. and all workers not changed status at all. if I return second failure-domain problems starts again.

Not completely shure what's going on, but whis issue may be related to failure-domains

@lahcim
Copy link

lahcim commented Sep 8, 2024

Confirming same issue here. Looks like the traefik/provider.yaml is losing data.
Possibly this happens in case reboot of the nodes coincides with refresh interval updating the file and file is not being correctly saved?
Can we possibly make this refresh interval logic more resilient to ensure accidental reboots during refresh do not end up in empty file after reboot?

@sbidoul
Copy link
Author

sbidoul commented Sep 18, 2024

I have just had a case of provider.yaml becoming empty on a worker node.

The logs at the time were absolutely quiet, and all there was is this

microk8s.daemon-apiserver-proxy[50117]: 2024/09/18 12:22:00 updating endpoints
microk8s.daemon-apiserver-proxy[50117]: 2024/09/18 12:22:00 Config file changed on disk, will restart proxy
microk8s.daemon-apiserver-proxy[50117]: Error: proxy failed: failed to load configuration: empty list of control plane endpoints

I've tried looking at the code but so far could not find the place that is responsible for this updating endpoints message. My hypothesis is that there is some incorrect error handling code there that leaves the file empty.

@abdavid
Copy link

abdavid commented Oct 16, 2024

Woke up this morning to an empty provider.yaml on one of my worker nodes.

HA Cluster, 3 control-planes and 7 workers.

Same logs as observed before

microk8s.daemon-apiserver-proxy[252757]: 2024/10/16 00:10:32 updating endpoints
microk8s.daemon-apiserver-proxy[252757]: 2024/10/16 00:10:32 Config file changed on disk, will restart proxy
microk8s.daemon-apiserver-proxy[252757]: Error: proxy failed: failed to load configuration: empty list of control plane endpoints
microk8s.daemon-apiserver-proxy[252757]: 2024/10/16 00:10:32 proxy failed: accept tcp [::]:16443: use of closed network connection

I do however observe this log line on one of my control-planes

microk8s.daemon-kubelite[820226]: W1016 00:10:31.058031  820226 lease.go:265] Resetting endpoints for master service "kubernetes" to [192.168.x.x 192.168.x.x]

@HomayoonAlimohammadi
Copy link
Contributor

Thank you all for your time and effort in bringing attention to this issue and for sharing your workarounds. I sincerely apologize for the inconvenience and the fact that this issue is still not resolved. We are currently working on identifying a solution, and we will provide updates as soon as we have more information.

@HomayoonAlimohammadi
Copy link
Contributor

I suspect this line to be responsible for this issue.
Unfortunately I was not able to reproduce this scenario so far. Even tried with a 13 node cluster (3 CP,s 10 workers, Microk8s v1.28) running on LXD containers and did a bunch of chaotic things to the cluster (restarting containers, removing, rejoining, etc.), yet still the problem didn't surface.
We'll start working towards a fix and post further updates here.

HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024
HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024
HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024
HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024
HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024
HomayoonAlimohammadi added a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024
@HomayoonAlimohammadi
Copy link
Contributor

Hopefully with the proposed fix we won't see this issue anymore. The change will soon be promoted to stable.
Let us know if you were still experiencing this issue.

@sbidoul
Copy link
Author

sbidoul commented Oct 25, 2024

Thank you @HomayoonAlimohammadi !

By the way, how can one identify the updates that are released on the stable channel?

bschimke95 pushed a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024
bschimke95 pushed a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024
bschimke95 pushed a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024
bschimke95 pushed a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024
bschimke95 pushed a commit to canonical/microk8s-cluster-agent that referenced this issue Oct 25, 2024
@HomayoonAlimohammadi
Copy link
Contributor

Hi @sbidoul! Sorry that I completely missed your question.
Unfortunately it's a bit hard to tell, but any snap of 1.28 up to 1.31 should contain the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests