-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][Opensearch] helm upgrade cause all master pods killed almost simultaneously and breaks the cluster #198
Comments
Hi @deng47 do you have any logs to show whether this is a normal termination of master nodes or some other errors? |
Below are logs from one of my new master pods. I noticed that it complained about
|
We have some changes between 1.2.1 to 1.2.3 to introduce some variable in env to disable security plugin. Between 1.2.3 and 1.2.4 we introduce two changes to fix the sed inline open a different inode issue: Both of these changes need write permission to opensearch.yml file for OpenSearch. I dont know what is your setup on your k8s cluster @deng47 but can you check your mounts permission to /usr/share/opensearch/config/? Thanks. |
@peterzhuamazon Thank you for your quick reply.
In the stateful set, it's the config dir is mounted with 420 permission. Can I change that in values.yaml? If not, sounds like upgrading to 1.2.4 is the best option
|
Is there any chance for you to check out 1.2.4 just to see if it runs? And I am quite confused why I think we probably need to address this in a later version with some smarter logic. Thanks. |
I did a
|
I created a 1.2.4 cluster(appVersion: 1.2.4;Chart version: 1.5.4) from scratch, then updated the content of
|
Is there a way you can cat the content of opensearch after the failure of sed? I have seen tee failure like this: opensearch-project/opensearch-build#1529 But never see a sed failure in 1.2.4. Please try to create a 1.2.4 and let me know the results. Thanks. |
Oops, you post it right when I was typing. Can you let me know if you can start a normal cluster with 1.2.4 follow this guide without mounting any specific devices? I was testing on kind and minikube but never seen any issues like in your case. |
Are you using a specific user to run the deployment? I suspect the folder is changed the perm to 660 but the file still own by 0:1000, with 644 perm, thus prevent modifying. |
I think we need to figure out why the file is changed to 0 user on the 1st place, I just verified on my side it is 1000 user on my side.
|
|
I created
I think you are right. If I can get the permission of those two files right, I should be able to fix it.
|
Tried
|
Adding a startupProbe to fix opensearch-project#198 [BUG][Opensearch] helm upgrade cause all master pods killed almost simultaneously and breaks the cluster When helm upgrades master pods, it kills all old master pods in a few seconds, leaving no time for new master pods to start up and join the cluster, eventually kills the whole cluster. A 30-second startupProbe solve this problem
Adding a startupProbe to fix opensearch-project#198 When helm upgrades master pods, it kills all old master pods in a few seconds, leaving no time for new master pods to start up and join the cluster, eventually killing the whole cluster. A 30-second startupProbe solve this problem
Close this for now as we have rereleased new 1.2.4 image. Thanks. |
Describe the bug
How I deployed my Onpensearch cluster: pulled the OpenSearch helm chart to local, and modified the values.yaml, then did
helm install <name> -f values.yaml --create-namespace -n <name>
I have 3 master pods in my Opensearch cluster. They are all persistence disabled. I updated the content of
opensearch.yml
in values.yaml, and did ahelm upgrade <name> -f values.yaml --create-namespace -n <name>
Kubernetes recreates all master pods with the new opensearch.yml. It's a rolling upgrade, but a master pod becomes ready in just a few seconds, so actually, Kubernetes kills all old master pods almost simultaneously. Once this happens, the cluster loses all master pods even though
kubectl get pods
shows all master pods are up and healthy.securityadmin.sh
in the master pod hangs onContacting opensearch cluster 'opensearch' and wait for YELLOW clusterstate ...
I believe the root cause is K8s kills master pods in a short time. I tripled the values of readinessProbe.periodSeconds and readinessProbe.successThreshold in values.yaml, but didn't see k8s wait any extra seconds when it kills pods. I found a new feature minReadySeconds in Kubernetes v1.23 that may solve my problem, however, I have a v1.21 k8s cluster.
To Reproduce
Steps to reproduce the behavior:
opensearch.yml
in values.yamlhelm upgrade
with the new values.yamlExpected behavior
Rolling upgrade should make sure a new master pod is really in a ready state before killing the next pod
Chart Name
Chart: opensearch
cat Chart.yaml
apiVersion: v2
appVersion: 1.2.3
description: A Helm chart for OpenSearch
maintainers:
name: opensearch
type: application
version: 1.5.4
Host/Environment (please complete the following information):
kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.6+k3s1", GitCommit:"df033fa248bc2c9f636e4c0ff2b782cb8edbbf10", GitTreeState:"clean", BuildDate:"2021-11-04T00:25:14Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.6+k3s1", GitCommit:"df033fa248bc2c9f636e4c0ff2b782cb8edbbf10", GitTreeState:"clean", BuildDate:"2021-11-04T00:25:14Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
helm version
version.BuildInfo{Version:"v3.7.1", GitCommit:"1d11fcb5d3f3bf00dbe6fe31b8412839a96b3dc4", GitTreeState:"clean", GoVersion:"go1.16.9"}
The text was updated successfully, but these errors were encountered: