-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to change ownership on socket #318
Comments
Hi @dippynark , The directory switch operation you are mentioning https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L142-L144 has a lock to avoid race condition:
Could you share more details about your Pod scheduling pattern? Specifically, how many Pods are you scheduling to the same node at the same time? Thank you! |
Hi @songjiaxun, thanks for clearing that up, There are 3 CronJobs each creating a Job every minute which each run one Pod that mounts a GCS bucket. All Pods are mounting the same GCS bucket. Each Pod does a small amount of processing and then exits so each Job typically takes between 30-40 seconds to run. We're using We are also using the optimize utilization GKE autoscaling profile which means the 3 Pods are typically all scheduled to the same node at similar times. Also, after seeing the socket error, we then started seeing lots of errors like the following (which we weren't seeing before the socket error):
|
Thanks @dippynark for reporting this issue. I am trying to reproduce this on my dev env now. |
Also, @dippynark , as we are moving forward to newer k8s versions, is it possible that you could consider upgrading your cluster version to 1.29? As you mentioned, the kubelet has a fix of the house keeping logic, and we will have a better chance to push any potential fixes much faster to newer k8s versions. |
Hi @songjiaxun, thanks yeah we are working on upgrading the cluster to latest version in the stable channel which should hopefully avoid this issue reoccurring |
Issue
We have observed the following error when using the GCS FUSE CSI Driver on GKE:
It appears the socket file could not be found after being created: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L165-L167
Perhaps there is a race condition when changing directory? https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/blob/v1.4.1/pkg/csi_mounter/csi_mounter.go#L142-L144
Impact
This issue seemed to cause the outcome described in the known issues doc where FUSE mount operations hang. I guess this is because socket creation happens after creating the FUSE mount but before passing the file descriptor to the GCS FUSE CSI Driver sidecar:
This interacted with a known kubelet issue where Pod cleanup hangs due to an unresponsive volume mount: kubernetes/kubernetes#101622
This then lead to all Pod actions stalling on the node: https://github.com/kubernetes/kubernetes/blob/v1.27.0/pkg/kubelet/kubelet.go#L148-L151
Confusingly, the node was not marked as unhealthy when this happened, however this seems to be due to an unrelated GKE node-problem-detector misconfiguration which I won't give details on here. Unfortunately, since this occurred in a production environment, we needed to manually deleted the node to bring the cluster back to a healthy state so it's not still around to verify this theory.
This issue has happened twice now on different nodes in the same cluster over the last week.
Note that the kubelet issue seems to have been fixed now, but not in the version of Kubernetes we are using: kubernetes/kubernetes#119968
Evironment
GKE version:
v1.27.11-gke.1062004
GCS FUSE version:
v1.4.1-gke.0
The text was updated successfully, but these errors were encountered: