Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High availability for Redis, Solr and Postgres #36

Open
wants to merge 1 commit into
base: release/4.0.0
Choose a base branch
from

Conversation

gislab-augsburg
Copy link

Proposed changes:

  • Make number of replicas configurable for Solr statefulset
  • Make number of replicas configurable for Redis statefulset

Background: We recently had an OpenShift update which produced some Solr Errors. Not a big thing, but as solr needs circa 1 minute for startup, the possibility of scaling up the number of pods/ containers would be good for high availability.

I added Redis for completeness. I see no reason why Redis replica count is only used for deployment and not for statefulset, but please do not hesitate to correct me here.

@BWibo I addressed the branch release/3.0.0 because we use it at the moment. No problem to change this to the main branch if it's more convenient for you.

@BWibo BWibo changed the title Replica count for solr and redis statefulset High availability for Redis, Solr and Postgres Mar 21, 2024
@BWibo BWibo changed the base branch from release/3.0.0 to devel March 21, 2024 07:22
@BWibo BWibo changed the base branch from devel to release/3.0.0 March 21, 2024 07:22
@gislab-augsburg
Copy link
Author

Oh, I did not test my changes carefully enough. Seems like 2 solr pods/containers cannot connect simultanously to ckan.

The first solr pod is working, the second produces an error :

2024-03-21 15:08:05.287 ERROR (qtp1426435610-18) [] o.a.s.s.HttpSolrCall org.apache.solr.core.SolrCoreInitializationException: SolrCore 'ckan' is not available due to init failure: Index dir '/var/solr/data/ckan/data/index/' of core 'ckan' is already locked. The most likely cause is another Solr server (or another solr core in this server) also configured to use this directory; other possible causes may be specific to lockType: native => org.apache.solr.core.SolrCoreInitializationException: SolrCore 'ckan' is not available due to init failure: Index dir '/var/solr/data/ckan/data/index/' of core 'ckan' is already locked. The most likely cause is another Solr server (or another solr core in this server) also configured to use this directory; other possible causes may be specific to lockType: native
at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:2217)
org.apache.solr.core.SolrCoreInitializationException: SolrCore 'ckan' is not available due to init failure: Index dir '/var/solr/data/ckan/data/index/' of core 'ckan' is already locked. The most likely cause is another Solr server (or another solr core in this server) also configured to use this directory; other possible causes may be specific to lockType: native

Also the second ckan pod is crashing because of no connection to solr:

/srv/app/start_ckan.sh: Ignoring /srv/app/docker-entrypoint.d/* (not an sh or py file)
Not all environment variables are set. Generating sessions...
[prerun] Start check_db_connection...
[prerun] Start check_solr_connection...
[prerun] Unable to connect to solr...try again in a while.
[prerun] Start check_solr_connection...
[prerun] Unable to connect to solr...try again in a while.
[prerun] Start check_solr_connection...
[prerun] Unable to connect to solr...try again in a while.
[prerun] Start check_solr_connection...
[prerun] Unable to connect to solr...try again in a while.
[prerun] Start check_solr_connection...
[prerun] Unable to connect to solr...try again in a while.
[prerun] Start check_solr_connection...
[prerun] Giving up after 5 tries...
[CKAN prerun] FAILED. Exiting...

@BWibo
Copy link
Member

BWibo commented Mar 22, 2024

Hehe, yes. High availability is significantly more complex than just raising the number of replicas - at least for stateful applications.

Let's have 3.0.0 released without this. We'll address this issue in another release.

There are HA charts available, that we can use. We'll have to check how much effort it is to integrate them and if they work with our app.

@BWibo BWibo changed the base branch from release/3.0.0 to release/4.0.0 March 22, 2024 21:57
@BWibo BWibo added effort: 8 type: feature Brand new functionality, features, pages, workflows, endpoints, etc. work: complex The situation is complex, emergent practices used. labels Mar 22, 2024
@BWibo BWibo added this to the v4.0.0 milestone Mar 22, 2024
@BWibo BWibo modified the milestones: v4.0.0, v5.0.0 Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort: 8 type: feature Brand new functionality, features, pages, workflows, endpoints, etc. work: complex The situation is complex, emergent practices used.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants