-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master node unresponsive after reboot #3204
Comments
Out of urgency and a last-ditch attempt I was actually able to work around and restore the cluster but I don't understand what's happening... Here are my steps:
Attached the tarball post-restore to see if it provides more clues to what happened. S |
Identical behaviour seen here on Ubuntu 22.04 LTS and MicroK8s v1.26.0 after an unclean shutdown. Mine was a 4-node cluster with HA mode but inspection-report-20230120_195306.tar.gz
I'm going to attempt the workaround in #3204 (comment) to restore the cluster |
@djjudas21 did you manage to get it working? |
@Marahin. Unfortunately I was never able to repair my cluster, so in the end I destroyed it, recreated new, and restored my PVCs from backup. It worked fine for 3 weeks but yesterday it broke again (#3735) and I'm trying to figure out how to fix it without having to restore from scratch again. I'm pretty sure it's related to dqlite quorum, but it's pretty bad that it's happened twice. |
This has happened to me about 10 times over the past couple of years. I have a battery backed up server rack that covers only the control nodes and allow my worker nodes to power off in the case of a loss of electricity. It's common I'm out for longer than the UPS can hold out. For some reason, one or two of my control nodes regularly will fail. The HA shifts to a standby worker node and I have time to fully wipe and reinstall one of my control nodes, which then takes back control. Yes, this sucks. A lot. I don't have any idea why it breaks, but you have a 30-40% chance anytime you hard-power-off a box that it won't come back up again. |
@djjudas21 @jhughes2112 Neither I was able to find the culprit and restore the cluster. It actually froze when trying to add the node that was reformatted. |
I managed to rescue a cluster that lost quorum tonight, which is why I was on this thread. If you lose HA status and need to recover from kubectl and microk8s hanging, just K3s is being used on another project at my company. That's the only other contender I'd consider tbh. |
Thanks @jhughes2112, I think I'd be more confident attempting to regain quorum next time. Unfortunately in #3735 I lost quorum in a pretty bad way and then made it worse because I didn't know how to tackle it (I tried to My cluster has broken like this twice now, and I have never been able to figure out a root cause. The nodes did not lose power or network. In my situation, I actually lost data because I was using OpenEBS cStor as hyperconverged storage, i.e. replicas of volumes are stored on local block devices across your nodes. Turns out the way cStor handles its own replica placement and failover is by relying on the kube API, so when MicroK8s loses quorum, so does your storage engine 🙈 I've had to start my cluster from scratch but I haven't found a way of adopting my existing cStor replicas into a new cluster, and OpenEBS "support" has been zero help. So I bought a NAS, and have restored most of my data from backup. At least I can re-adopt PVs from a NAS into a vanilla cluster. Lesson learned. At work I use Red Hat OpenShift as the Kubernetes of choice, which is obviously way overkill for a home setup. Today they asked me to look into Rancher and it actually looks pretty good. It uses k3s underneath but adds some nice features. Will definitely consider it at home next time I have to smash everything up and start again. |
@djjudas21 I had a similar problem when I first started out using an older version of Rancher and Longhorn. Longhorn uses local disks, and I blew away nodes (not knowing what I was doing). Very frustrating. I threw FreeNAS on a separate box with a bunch of disks and use a package called Democratic CSI as my StorageClass that lets me serve NFS volumes--very handy when I need to mount them from a windows or linux box from anywhere, since NFS allows file sharing. Not production-worthy performance, but very convenient. I may need to look into Rancher again, now that k3s is out. Seems like a lot of us stub our toes the same way. ;-) |
@jhughes2112 yes I'm using TrueNAS with Democratic CSI too. The guy who maintains Democratic has been super helpful when I've had questions etc. I just wish I'd learnt these hard lessons at work with customer data, rather than at home with my own data 😂 |
I am also experiencing this issue |
I had to deal with this yesterday. I lost power at my house and my networking gear restarted. When that happened, my cluster lost quorum. At the time, I didn't know that since nothing was indicating this. My error was the same as everyone else's.
When I began to debug the issue, I did see the documentation from MicroK8s about restoring quorum (https://microk8s.io/docs/restore-quorum). I didn't think it applied to me since I only had two master nodes, so I ignored it and just performed the backup part: Here is what my cluster yaml looked like on my main master (I call it my main master since I started my cluster with it):
X.X.X.239 is my other master and X.X.X.10 is one of my worker-only nodes. I was surprised to see that it only showed one worker node, so I made a mental note of that. I think initially it was a master and I converted it to a worker. After several restart attempts failed, I decided to spin up a new VM to host another cluster, hoping that I could use the backup I created to somehow recover the X.X.X.20 master node. When I tried to restore my backup.tar onto the new working cluster, I was able to reproduce the issue. This excited me because I knew the backup worked. So, I returned to my main master node and followed the instructions from the MicroK8s website on recovering quorum. I modified the cluster.yaml file to look like this:
Then, I ran the reconfigure command. I did this because I didn't care about the other nodes on that list. All of my data was safe in a Ceph cluster hosted by my broken cluster, and none of those on this list were storage nodes. At that time, all my worker nodes were off except for one (it was a bare metal and I didn't feel like shutting it down mostly because I forgot), so I started MicroK8s on the master node that I performed the fix on and I've never been happier in my life. Kubernetes started and when I ran My lesson learned here is that I really need to buy a battery backup for my networking gear. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I am also experiencing this issue. I am running Microk8s 1.26 on my Ubuntu 22.04 desktop with ZFS. In one day, my machine shutdown suddenly when I am removing a 7TiB file. Then I got the error:
|
Hey all. I abandoned microk8s after two years and tried k3s and had a much worse experience (there's a continuous increase in CPU that eventually chokes your control nodes). What both of these packages have in common in Kine for the data store. I was heavily involved trying to help diagnose a cpu spike problem on microk8s that had to do with Kine and the way it handled when a node was using too much async io. It basically fails and makes the control plane unresponsive. Kine is unfixable. Unfixable. Drop it like it's hot. Unfixable. I switched to k0s about six months ago and it's been fine. Power fluctuations, random disconnections, etc, have not had to rebuild my cluster. (I'm not doing HA so maybe dodging a bullet there.) If you decide to stay on microk8s (which I loved) find a way to stop using Kine as the data store on the control plane. Less easy but will work. Etcd is battle tested. Gl:hf |
Thanks for your reply. Finally, I rebuild my cluster. I think I should switch to k0s too. |
Summary
I have a microk8s cluster deployment on x86 (Ubuntu 20.04, Snap Microk8s v1.24.0). I configured it with two slave nodes and they have been running correctly for weeks. However, as soon as I restarted the master node one day the server would stop responding - Kubectl completely unresponsive,
microk8s start
hangs atStarted.
. A look at thekubelite
service log reveals suspicious log patterns:Port 12379 points at etcd but I couldn't find anything wrong with it..
What Should Happen Instead?
Kubernetes should start.
Reproduction Steps
I did not try to repro, but here is what I did--
Attached inspect report
Notably, "Inspecting Kubernetes" took almost 5 minutes to complete.
inspection-report-20220605_003610.tar.gz
Can you suggest a fix?
no
Are you interested in contributing with a fix?
no
The text was updated successfully, but these errors were encountered: