Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to scale the cluster after upgrading the opensearch operator to 2.6.1 #942

Open
narendraalla opened this issue Jan 16, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@narendraalla
Copy link

What is the bug?

A clear and concise description of the bug.
Unable to scale the cluster after upgrading the opensearch operator to 2.6.1,
opensearch version : 2.17.0
The Cluster has around 25 hot, 34 warm, 3 master and 4 coordinators after upgrading the operator and opensearch we are unable to scale the pods, below is the error we notice in the operator
###############
{"level":"error","ts":"2025-01-16T13:30:08.971Z","msg":"Failed to store node certificate in secret","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"Cluster-A-opensearch","namespace":"default"},"namespace":"default","name":"Cluster-A-opensearch","reconcileID":"e42d182f-1065-4b9c-af89-0d69ce424ce6","interface":"transport","error":"updating resource failed: Secret "Cluster-A-opensearch-transport-cert" is invalid: metadata.annotations: Too long: must have at most 262144 bytes","errorVerbose":"Secret "Cluster-A-opensearch-transport-cert" is invalid: metadata.annotations: Too long: must have at most 262144 bytes\nupdating resource failed\ngithub.com/cisco-open/operator-tools/pkg/reconciler.(*GenericResourceReconciler).ReconcileResource\n\t/go/pkg/mod/github.com/cisco-open/[email protected]/pkg/reconciler/resource.go:518\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers/k8s.K8sClientImpl.ReconcileResource\n\t/workspace/pkg/reconcilers/k8s/client.go:198\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers/k8s.K8sClientImpl.CreateSecret\n\t/workspace/pkg/reconcilers/k8s/client.go:73\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*TLSReconciler).handleTransportGeneratePerNode\n\t/workspace/pkg/reconcilers/tls.go:400\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*TLSReconciler).handleTransport\n\t/workspace/pkg/reconcilers/tls.go:88\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*TLSReconciler).Reconcile\n\t/workspace/pkg/reconcilers/tls.go:67\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/controllers.(*OpenSearchClusterReconciler).reconcilePhaseRunning\n\t/workspace/controllers/opensearchController.go:328\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/controllers.(*OpenSearchClusterReconciler).Reconcile\n\t/workspace/controllers/opensearchController.go:143\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695","stacktrace":"github.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*TLSReconciler).handleTransportGeneratePerNode\n\t/workspace/pkg/reconcilers/tls.go:402\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*TLSReconciler).handleTransport\n\t/workspace/pkg/reconcilers/tls.go:88\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/reconcilers.(*TLSReconciler).Reconcile\n\t/workspace/pkg/reconcilers/tls.go:67\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/controllers.(*OpenSearchClusterReconciler).reconcilePhaseRunning\n\t/workspace/controllers/opensearchController.go:328\ngithub.com/Opster/opensearch-k8s-operator/opensearch-operator/controllers.(*OpenSearchClusterReconciler).Reconcile\n\t/workspace/controllers/opensearchController.go:143\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226"}

How can one reproduce the bug?

Steps to reproduce the behavior.
Upgrade the operator and increase the replicas by doing kubectl apply config-file.yml

What is the expected behavior?

A clear and concise description of what you expected to happen.
it should scale the pods as per the applied settings

What is your host/environment?

LKE

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

A this point we dont know if we can downgrade the operator as its a production cluster and we are unable to scale the cluster to tackle the additional load

@narendraalla narendraalla added bug Something isn't working untriaged Issues that have not yet been triaged labels Jan 16, 2025
@narendraalla narendraalla changed the title [BUG] [BUG] Unable to scale the cluster after upgrading the opensearch operator to 2.6.1, Jan 16, 2025
@narendraalla narendraalla changed the title [BUG] Unable to scale the cluster after upgrading the opensearch operator to 2.6.1, [BUG] Unable to scale the cluster after upgrading the opensearch operator to 2.6.1 Jan 16, 2025
@evheniyt
Copy link
Contributor

I think it's probably because of the banzaicloud.com/last-applied annotation that is applied for secrets created by operator. But if I'm not mistaken that behavior was also before 2.6.1.

Image

Because each node has its own certificate and all certificates are stored under the same secret, at some point, ' banzaicloud.com/last-applied` annotation becomes too big (more than 256 kB, the kubernetes limit).

I have checked the code and I don't see how the operator is using this annotation which is probably coming from github.com/cisco-open/operator-tools https://github.com/cisco-open/k8s-objectmatcher/blob/17b471df56d7e1a04c07209f78a783df43c13675/patch/annotation.go#L30

As I don't see that this annotation could be omitted in this package, fixing this will probably require changing the design from a single secret to a separate secret per each node, or stop using this package for reconciler.

@swoehrl-mw
Copy link
Collaborator

I think it's probably because of the banzaicloud.com/last-applied annotation that is applied for secrets created by operator. But if I'm not mistaken that behavior was also before 2.6.1.

I agree, thats probably the culprit, and that was part of the operator for quite some time.

I have checked the code and I don't see how the operator is using this annotation which is probably coming from github.com/cisco-open/operator-tools https://github.com/cisco-open/k8s-objectmatcher/blob/17b471df56d7e1a04c07209f78a783df43c13675/patch/annotation.go#L30

Correct, the operator does not explicitly set this, the library does.

As I don't see that this annotation could be omitted in this package, fixing this will probably require changing the design from a single secret to a separate secret per each node, or stop using this package for reconciler.

Separate secrets per pod would be ideal, but this is not supported by Kubernetes for StatefulSets. Getting rid of that library would also require quite a bit of work as it has convinience features that we would have to reimplement.

@narendraalla As a workaround you could try with one certificate for all nodes (found as spec.security.tls.transport.perNode in the cluster manifest), but I'm not sure if this can be rolled out without downtime.

@evheniyt
Copy link
Contributor

Separate secrets per pod would be ideal, but this is not supported by Kubernetes for StatefulSets. Getting rid of that library would also require quite a bit of work as it has convinience features that we would have to reimplement.

another option that could improve things slightly is to have a separate secret per each node pool.

@narendraalla
Copy link
Author

@evheniyt @swoehrl-mw sure will check if i can use single cert for all the nodes, or seperate cert for each node pools, i might be wrong, would it work if we increase the maxFrameSize -Djdk.http2.maxFrameSize=1048576 ?

@swoehrl-mw
Copy link
Collaborator

i might be wrong, would it work if we increase the maxFrameSize -Djdk.http2.maxFrameSize=1048576 ?

The size limit is in Kubernetes, changing java configuration in Opensearch will have no impact on that

@prudhvigodithi prudhvigodithi removed the untriaged Issues that have not yet been triaged label Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 🆕 New
Development

No branches or pull requests

4 participants