Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale faucet #278

Open
mutantcornholio opened this issue May 26, 2023 · 12 comments
Open

Scale faucet #278

mutantcornholio opened this issue May 26, 2023 · 12 comments
Assignees

Comments

@mutantcornholio
Copy link
Contributor

Currently failed deployments lead to outages.
Let's have two instances on each, so failed deployments would lead to stuck deploys, not downtimes.

@mutantcornholio mutantcornholio self-assigned this May 26, 2023
@mordamax
Copy link
Contributor

@mutantcornholio could you link to pipelines or some logs or provide log examples?

is it when you deploy, and the app can't start, it is: 1. failing CI job and 2. still ends up with deployed app with broken code?

@PierreBesson
Copy link
Contributor

The problem is that if you "scale" the faucet to 2 instances then there will be two processes listening to messages on Matrix and so drips will be produced twice.

@mutantcornholio
Copy link
Contributor Author

The problem is that if you "scale" the faucet to 2 instances then there will be two processes listening to messages on Matrix and so drips will be produced twice.

Yes, that obviously needs to be dealt with.

@mutantcornholio
Copy link
Contributor Author

Probably more inclined to make it the same way as gitspiegel works: upon receiving a message, write its id to database, if another instance has beaten us to it, ignore.

@paritytech/opstooling WDYT?

@chevdor
Copy link
Contributor

chevdor commented May 30, 2023

For this usecase, it sounds much more appropriate to use a queue such as RabbitMQ (just to throw a name).

You'd need one (or more) listener that adds the "job" to the queue. If using several listener, you wanna make sure the use of a key allows preventing dups.

With that it becomes much easier to have as many "worker" as you wish (ie k8s deploys) that will pick up the tasks and remove them from the queue once successfully done. If they fail, the entry remains in the queue and can be picked up by the next worker.

@mutantcornholio
Copy link
Contributor Author

mutantcornholio commented May 30, 2023

The cost of splitting an instance the into master / worker wouldn't be worth it IMO.

The goal is to have minimum redundancy to allow maintenance without downtime.
Actual load here is laughable, and unlikely to require horizontal scaling in foreseeable future.

If we go with splitting the instance, we will end up with four instances for every network, while we would do perfectly fine with two.
Requiring two instances for the local delopment also is a downside.

We could still go with a job queue, but have both producer and consumer in the same instance. That would be basically the same as what I suggested, except now we'd get free stuff like retries, timeouts, etc.

@mordamax
Copy link
Contributor

mordamax commented May 30, 2023

I'm a bit confused, why wasn't it a problem before?
afair when we used to deploy through helm, it was trying to make a new instance, and if it has started ok, it was replacing one (old) instance with another (new). If I try to deploy broken code or configs, it will just fail on CI level, and prod wouldn't be affected
Is it working differently via ArgoCD?

if yes, are there different ways to solve it rather than having 2 instances and introducing DB etc... ?

that sounds like overkill to me for the problem of wrong configuration or something

@mutantcornholio
Copy link
Contributor Author

afair when we used to deploy through helm, it was trying to make a new instance, and if it has started ok, it was replacing one (old) instance with another (new). If I try to deploy broken code or configs, it will just fail on CI level, and prod wouldn't be affected
Is it working differently via ArgoCD?
if yes, are there different ways to solve it rather than having 2 instances and introducing DB etc... ?

Feels like it always worked like that and nobody cared.

I don't think that any deployment configuration can get around the problem of two instances listening to same matrix events, and duplicating drips as the result.
"Replacing" the instance implies changing the backend for load balancer. If instances get their load by themselves, it won't work.

@chevdor
Copy link
Contributor

chevdor commented May 30, 2023

Feels like it always worked like that and nobody cared.

I think we have been lucky so far.
The helm chart uses deployments and allow an arbitrarty number of replicas using rolling updates.
All is good as long we we use only one replica.

@mutantcornholio
Copy link
Contributor Author

I also realised that faucet stores its drips in a local, non-persistent (!) sqlite.
It needs them to check the daily/hourly quotas. Also needs to be addressed.

However, it's all simple stuff, isn't it?

@mordamax
Copy link
Contributor

I can't find logs anymore unfortunately (it'd be great to save the snapshot in text format next time)
So I am not sure we deal with "failed deployments" correctly, by scaling to 2 instances
Do I understand right - if we set up properly livenessProbe & readinessProbe, then it should roll back the deployment if they don't pass? won't it be the proper fix for the deployment problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants