Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No HA documentation. Is HA supported ? #489

Closed
bernardgut opened this issue Feb 20, 2024 · 9 comments
Closed

No HA documentation. Is HA supported ? #489

bernardgut opened this issue Feb 20, 2024 · 9 comments

Comments

@bernardgut
Copy link

Hello

There is no documentation on how to use this chart to deploy an HA instance of OCIS on Kubernetes. We are currently looking at alternatives and I think this should really help OCIS case. Bitnami provides a good example on how to document standalone//ha deployment for a helm chart In a straight-forward manner, in the README. For example : https://github.com/bitnami/charts/tree/main/bitnami/rabbitmq/. (Or any other Bitnami chart)

For OCIS, I cannot find any straight forward documentation for:

  • HA: Default HA setup values.yaml (recommended values).
  • HA: Which are the values.yaml parameters that impact HA and possible values.
  • Persistence: Which are the values.yaml parameters that impact persistence and what are the possible values.
  • Requirements for all of the above. (software, HW and storage requirements if any)
  • HA : What level of HA is provided. in a straight forward language please (!). AKA if you have a 2 nodes setup with the above configuration and one node dies, this is the impact... You can alleviate this impact by deploying separate HA services for X, Y and Z and configuring the Helm chart accordingly.

More generally, while writing all of the above, and after 2 weeks tinkering with OCIS, I just realized that I could not anser the following question:

  • Is HA Supported ?

if not that is fine, but please let us know by putting it clearly in the README, at the top, please.

Right now the documentation is very verbose about the WHY but nothing about the WHAT/HOW/WHEN/WHERE. It took me one week to figure out that you have to deploy the message queue separately for HA because this is not documented anywhere except in some obscure page on the website and some Github issues. Same for storage. No info whatsoever except one comment somewhere in the docs that says you need ReadWriteMany (?) PVs for scaling some services (which for such a major requirement should be on the first paragraph of the README).

At this stage, and without clarifications about the above, I cannot recommend OCIS for production deployments in a professional setup. Which is a shame.

Thank you. Sorry for the rant

Cheers
B.

@wkloucek
Copy link
Contributor

Hi @bernardgut, I raised that topic internally. Right now there is no documentation that I could provide to you.

I can point you to these options:

# -- Number of replicas for each scalable service. Has no effect when `autoscaling.enabled` is set to `true`.
replicas: 1

or

# -- Sets minimum replicas for autoscaling.
minReplicas: 3

and to this issue: #15

@bernardgut
Copy link
Author

bernardgut commented Mar 21, 2024

Thank you for the anser @wkloucek I will try tinkering with these in the testing setup.

Any info on persistence-impacting parameters (or just how persistence is handled by default and how to edit the default behaviour (which parameters, where)) ?

Best,
B.

@wkloucek
Copy link
Contributor

Any info on persistence-impacting parameters (or just how persistence is handled by default and how to edit the default behaviour (which parameters, where)) ?

generally, look out for the services' persistence: section.

Example:

# -- Persistence settings.
# @default -- see detailed persistence configuration options below
persistence:
# -- Enables persistence.
# Needs to be enabled on production installations.
# If not enabled, pod restarts will lead to data loss.
# Also scaling this service beyond one instance is not possible if the service instances don't share the same storage.
enabled: false
# -- Enables an initContainer to chown the volume.
# The initContainer is run as root.
# This is not needed if the driver applies the fsGroup from the securityContext.
# The image specified in `initContainerImage` will be used for this container.
chownInitContainer: false
# -- Storage class to use.
# Uses the default storage class if not set.
storageClassName:
# -- Persistent volume access modes. Needs to be `["ReadWriteMany"]` when scaling this service beyond one instance.
accessModes:
- ReadWriteMany

While looking at it I found some wrong information, that will be fixed in #520

Generally you'll also see a warning when installing / updating the helm chart. Eg. if you don't have persistence enabled at all, you will see something like this:

#################################################################################
######   WARNING: Persistence is disabled for some services.                #####
######   You will lose your data when a service's pod is terminated.        #####
######                                                                      #####
######   Following services don't use persistence:                          #####
######     - storageusers                                                   #####
######     - storagesystem                                                  #####
######     - web                                                            #####
######     - idm                                                            #####
######     - search                                                         #####
######     - nats                                                           #####
#################################################################################

@bernardgut
Copy link
Author

I saw the fixes with #520. This clears up a lot of doubts with HA we had when we first tested this. I will try again to play with this as soon as I have some free time again and report back on HA deployment results here.

Thanks

@wkloucek
Copy link
Contributor

Closing for now. We can reopen if there are more questions

@bernardgut
Copy link
Author

bernardgut commented Dec 9, 2024

Hello @wkloucek

I am giving a second go at this this weekend and this week. I see there was a few improvements with the chart when it comes to HA which is great! I have a few feedback points that I will share at the end of the testing round but for now I am blocked and the question is the following :

  • Can I set storagesystem.persistence.enabled=false if storageusers.driver=s3ng AND store.type=nats-jv-kv|redis-sentinel ? According to the docs storage-system is basically a cache for metadata/user-data.
  • same question for storageusers.persistence.enabled. Which if I understand properly is the storage layer for user data

I am confused as to why you would need a ReadWriteMany volume on top of redis and block storage for these two services: They have access to stateful data storage backends but still need readWriteMany ? (at least that is what the chart says) which is basically an antipattern for kubernetes production deployments

Also on a side note (less important):

Why not setup a Helm repository with versioned releases ? If you want, I can help you set up a CI pipeline where pushing a change to the chart version version

version: 0.7.0
will automatically trigger a corresponding release using GH Actions and make it available at the integrated helm repository for your org https://owncloud.github.io/ocis. users can then install this chart with the classic helm repo add https://owncloud.github.io/ocis and helm install test-release ocis/ocis... Let me know if you're interested.

Thanks

Cheers
Bernard.

@wkloucek
Copy link
Contributor

According to the docs storage-system is basically a cache for metadata/user-data.

It actually stores stuff. It stores sharing data, user settings, ... It has caching mechanisms, too but always needs persistence via a RWX volume (RWO is sufficient if you don't need to scale it to multiple replicas and can do a Recreate rollout strategy).

I am confused as to why you would need a ReadWriteMany volume on top of redis and block storage for these two services: They have access to stateful data storage backends but still need readWriteMany ? (at least that is what the chart says) which is basically an antipattern for kubernetes production deployments

storageusers, storagesystem (and ocm service if you'd use OCM) need RWX volumes because this is where the metadata is stored for the S3ng storage driver and where metadata + blobs are stored for the ocis storage driver.

NATS / Redis-Sentinel is actually used as cache and store like you described, but NOT for files. Maybe there's gonna be another storage driver in the future that can live without RWX for storing metadata and leveraging on of those two key-value-stores... (but that would be the oCIS product that needs to d

Why not setup a Helm repository with versioned releases ? If you want, I can help you set up a CI pipeline where pushing a change to the chart version version

#611 (comment) still applies

@bernardgut
Copy link
Author

Very well, thanks for the quick answer. I will deploy in SA for now and keep testing.

NATS / Redis-Sentinel is actually used as cache and store like you described, but NOT for files. Maybe there's gonna be another storage driver in the future that can live without RWX for storing metadata and leveraging on of those two key-value-stores... (but that would be the oCIS product that needs to d

Is there an issue somewhere tracking this on the OCIS repo (that you are aware of)? I feel like this is a pretty critical requirement for prod deployments on kubernetes

(basically most ReadWriteMany implementations on kubernetes are a variation of NFS and there are many reasons not to want to use NFS for storing metadata/user-data like you described).

If not should I create one ? I might even help if I end up adopting this.

@wkloucek
Copy link
Contributor

Is there an issue somewhere tracking this on the OCIS repo (that you are aware of)?

The product roadmap has three storage related items: https://github.com/orgs/owncloud/projects/344/views/1?filterQuery=-quarter%3A%22Q4+%2F+2023%22%2C%22Q1+%2F+2024%22++storage

Otherwise there is:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants