Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to change ownership on socket #318

Open
dippynark opened this issue Jul 24, 2024 · 5 comments
Open

Failed to change ownership on socket #318

dippynark opened this issue Jul 24, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@dippynark
Copy link

dippynark commented Jul 24, 2024

Issue

We have observed the following error when using the GCS FUSE CSI Driver on GKE:

/csi.v1.Node/NodePublishVolume failed with error: rpc error: code = Internal desc = failed to mount volume "[REDACTED]" to target path "/var/lib/kubelet/pods/[REDACTED]/volumes/kubernetes.io~csi/[REDACTED]/mount": failed to change ownership on socket: chown ./socket: no such file or directory

It appears the socket file could not be found after being created: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L165-L167

Perhaps there is a race condition when changing directory? https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L142-L144

Impact

This issue seemed to cause the outcome described in the known issues doc where FUSE mount operations hang. I guess this is because socket creation happens after creating the FUSE mount but before passing the file descriptor to the GCS FUSE CSI Driver sidecar:

This interacted with a known kubelet issue where Pod cleanup hangs due to an unresponsive volume mount: kubernetes/kubernetes#101622

This then lead to all Pod actions stalling on the node: https://github.com/kubernetes/kubernetes/blob/v1.27.0/pkg/kubelet/kubelet.go#L148-L151

Confusingly, the node was not marked as unhealthy when this happened, however this seems to be due to an unrelated GKE node-problem-detector misconfiguration which I won't give details on here. Unfortunately, since this occurred in a production environment, we needed to manually deleted the node to bring the cluster back to a healthy state so it's not still around to verify this theory.

This issue has happened twice now on different nodes in the same cluster over the last week.

Note that the kubelet issue seems to have been fixed now, but not in the version of Kubernetes we are using: kubernetes/kubernetes#119968

Evironment

GKE version: v1.27.11-gke.1062004
GCS FUSE version: v1.4.1-gke.0

@songjiaxun
Copy link
Contributor

Hi @dippynark ,

The directory switch operation you are mentioning https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L142-L144 has a lock to avoid race condition:

Could you share more details about your Pod scheduling pattern? Specifically, how many Pods are you scheduling to the same node at the same time? Thank you!

@dippynark
Copy link
Author

dippynark commented Jul 25, 2024

Hi @songjiaxun, thanks for clearing that up,

There are 3 CronJobs each creating a Job every minute which each run one Pod that mounts a GCS bucket. All Pods are mounting the same GCS bucket.

Each Pod does a small amount of processing and then exits so each Job typically takes between 30-40 seconds to run. We're using concurrencyPolicy: Forbid on the CronJobs so we don't get more Jobs running than CronJobs even if they sometimes take longer than a minute to run.

We are also using the optimize utilization GKE autoscaling profile which means the 3 Pods are typically all scheduled to the same node at similar times.

Also, after seeing the socket error, we then started seeing lots of errors like the following (which we weren't seeing before the socket error):

/csi.v1.Node/NodePublishVolume failed with error: rpc error: code = Aborted desc = An operation with the given volume key /var/lib/kubelet/pods/[REDACTED]/volumes/kubernetes.io~csi/[REDACTED]/mount already exists

@songjiaxun
Copy link
Contributor

Thanks @dippynark for reporting this issue. I am trying to reproduce this on my dev env now.

@songjiaxun
Copy link
Contributor

Also, @dippynark , as we are moving forward to newer k8s versions, is it possible that you could consider upgrading your cluster version to 1.29? As you mentioned, the kubelet has a fix of the house keeping logic, and we will have a better chance to push any potential fixes much faster to newer k8s versions.

@dippynark
Copy link
Author

Hi @songjiaxun, thanks yeah we are working on upgrading the cluster to latest version in the stable channel which should hopefully avoid this issue reoccurring

@songjiaxun songjiaxun added the bug Something isn't working label Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants