You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This one is sort of a rabbit hole, but I think I've figured out what is causing it.
While attempting to scale up the CHEESEhub production cluster from 1 worker to 2, things seemed to have gone smoothly. During testing, I noticed that some pods were failing to start and would remain in ContainerCreating. Describing the pod yielded an unfamiliar error message looping many times:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m46s default-scheduler Successfully assigned lambert8/szhq4c-sqlinjection-ntprk to cheesehub-worker0
Warning FailedCreatePodSandBox 2m46s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "szhq4c-sqlinjection-ntprk": Error response from daemon: mkdir /var/lib/docker/overlay2/00bbc8fca42634c3bab11d005cb25e6d76a27a57a3cc59d780ea068867a0ce93-init: input/output error
Warning FailedCreatePodSandBox 2m33s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "szhq4c-sqlinjection-ntprk": Error response from daemon: mkdir /var/lib/docker/overlay2/f6af26efbf925d8c41c82e816823ebc36310ec62ae343ae201faaea11937d9e2-init: input/output error
After much trial and error, I noticed that only pods scheduled on worker0 showed this error behavior. Digging into the specifics of that node, I looked at dmesg -T to find an odd error from the Docker daemon:
[Thu Oct 29 18:44:57 2020] EXT4-fs warning (device vdb): dx_probe:740: inode #2097153: lblock 0: comm dockerd: error -5 reading directory block
[Thu Oct 29 18:44:57 2020] EXT4-fs warning (device vdb): dx_probe:740: inode #2097153: lblock 0: comm dockerd: error -5 reading directory block
[Thu Oct 29 18:44:57 2020] EXT4-fs warning (device vdb): dx_probe:740: inode #2097153: lblock 0: comm dockerd: error -5 reading directory block
[Thu Oct 29 18:44:57 2020] EXT4-fs warning (device vdb): dx_probe:740: inode #2097153: lblock 0: comm dockerd: error -5 reading directory block
[Thu Oct 29 18:44:58 2020] EXT4-fs error (device vdb): __ext4_get_inode_loc:4115: inode #2889247: block 11534721: comm dockerd: unable to read itable block
[Thu Oct 29 18:44:58 2020] EXT4-fs warning (device vdb): dx_probe:740: inode #2097153: lblock 0: comm dockerd: error -5 reading directory block
[Thu Oct 29 18:44:58 2020] EXT4-fs warning (device vdb): dx_probe:740: inode #2097153: lblock 0: comm dockerd: error -5 reading directory block
[Thu Oct 29 18:45:00 2020] EXT4-fs error (device vdb): __ext4_get_inode_loc:4115: inode #2889247: block 11534721: comm dockerd: unable to read itable block
[Thu Oct 29 18:45:00 2020] EXT4-fs error (device vdb): __ext4_get_inode_loc:4115: inode #2889247: block 11534721: comm dockerd: unable to read itable block
[Thu Oct 29 18:45:00 2020] EXT4-fs error (device vdb): __ext4_get_inode_loc:4115: inode #2889345: block 11534728: comm dockerd: unable to read itable block
[Thu Oct 29 18:45:00 2020] EXT4-fs error (device vdb): __ext4_get_inode_loc:4115: inode #2889230: block 11534720: comm dockerd: unable to read itable block
[Thu Oct 29 18:45:00 2020] EXT4-fs error (device vdb): __ext4_get_inode_loc:4115: inode #2889211: block 11534719: comm dockerd: unable to read itable block
Attempting to restart Docker failed:
ubuntu@cheesehub-worker0:~$ sudo systemctl restart docker
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
ubuntu@cheesehub-worker0:~$ systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: failed (Result: start-limit-hit) since Thu 2020-10-29 18:53:42 UTC; 257ms ago
Docs: https://docs.docker.com
Process: 1709 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=1/FAILURE)
Main PID: 1709 (code=exited, status=1/FAILURE)
Oct 29 18:53:40 cheesehub-worker0 systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Oct 29 18:53:40 cheesehub-worker0 systemd[1]: Failed to start Docker Application Container Engine.
Oct 29 18:53:40 cheesehub-worker0 systemd[1]: docker.service: Unit entered failed state.
Oct 29 18:53:40 cheesehub-worker0 systemd[1]: docker.service: Failed with result 'exit-code'.
Oct 29 18:53:42 cheesehub-worker0 systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Oct 29 18:53:42 cheesehub-worker0 systemd[1]: Stopped Docker Application Container Engine.
Oct 29 18:53:42 cheesehub-worker0 systemd[1]: docker.service: Start request repeated too quickly.
Oct 29 18:53:42 cheesehub-worker0 systemd[1]: Failed to start Docker Application Container Engine.
Oct 29 18:53:42 cheesehub-worker0 systemd[1]: docker.service: Unit entered failed state.
Oct 29 18:53:42 cheesehub-worker0 systemd[1]: docker.service: Failed with result 'start-limit-hit'.
I then tried to look into /var/lib/docker only to find that it wasn't even there:
ubuntu@cheesehub-worker0:~$ sudo su
root@cheesehub-worker0:/home/ubuntu# cd /var/lib/docker
root@cheesehub-worker0:/var/lib/docker# ls -al
ls: reading directory '.': Input/output error
total 0
Checked mount to see what should have been there:
root@cheesehub-worker0:/var/lib/docker# mount | grep /var/lib/docker
/dev/vdb on /var/lib/docker type ext4 (ro,relatime,data=ordered)
Looking at lsblk shows no device at vdb, and that this device name had somehow changed from vdb to vdc:
root@cheesehub-worker0:/var/lib/docker# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 253:0 0 20G 0 disk
└─vda1 253:1 0 20G 0 part /
vdc 253:32 0 75G 0 disk
During this time, I should also note that OpenStack still reports the stale device name in their UI:
Name
Description
Size
Status
Group
Type
Attached To
Availability Zone
Bootable
Encrypted
cheesehub-worker0-docker
-
75GiB
In-use
-
default
/dev/vdb on cheesehub-worker0
nova
No
No
I was wondering how this could possibly have happened and the only logical explanation is that Terraform somehow clobbered this mount while scaling up the cluster.
Checking the Action Log in OpenStack, it looks like the docker volume was detached and reattached while scaling up. Without restarting the node, this would obviously be very bad for the cluster resources running here.
I'm fairly certain this has to do with the imperative nature with which we've written the provisioning/bootstrapping side of the Terraform deployment.
In the short-term, rebooting the node resolved the issue with the mount.
In the long-term, this needs to be adjusted to account for existing nodes when scaling up. These mounts do not need to be touched, so we need to figure out why Terraform thinks they need to change and prevent it from making that change.
The text was updated successfully, but these errors were encountered:
This one is sort of a rabbit hole, but I think I've figured out what is causing it.
While attempting to scale up the CHEESEhub production cluster from
1
worker to2
, things seemed to have gone smoothly. During testing, I noticed that some pods were failing to start and would remain inContainerCreating
. Describing the pod yielded an unfamiliar error message looping many times:After much trial and error, I noticed that only pods scheduled on
worker0
showed this error behavior. Digging into the specifics of that node, I looked atdmesg -T
to find an odd error from the Docker daemon:Attempting to restart Docker failed:
I then tried to look into
/var/lib/docker
only to find that it wasn't even there:Checked
mount
to see what should have been there:Looking at
lsblk
shows no device atvdb
, and that this device name had somehow changed fromvdb
tovdc
:During this time, I should also note that OpenStack still reports the stale device name in their UI:
I was wondering how this could possibly have happened and the only logical explanation is that Terraform somehow clobbered this mount while scaling up the cluster.
Checking the Action Log in OpenStack, it looks like the docker volume was detached and reattached while scaling up. Without restarting the node, this would obviously be very bad for the cluster resources running here.
I'm fairly certain this has to do with the imperative nature with which we've written the provisioning/bootstrapping side of the Terraform deployment.
In the short-term, rebooting the node resolved the issue with the mount.
In the long-term, this needs to be adjusted to account for existing nodes when scaling up. These mounts do not need to be touched, so we need to figure out why Terraform thinks they need to change and prevent it from making that change.
The text was updated successfully, but these errors were encountered: