Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove graceful shutdown #29

Open
kalabukdima opened this issue Jan 6, 2025 · 6 comments
Open

Remove graceful shutdown #29

kalabukdima opened this issue Jan 6, 2025 · 6 comments

Comments

@kalabukdima
Copy link
Collaborator

Inspired by https://lwn.net/Articles/191059/
Worker is basically stateless, so removing the cancellation token will only simplify things

@mo4islona
Copy link

mo4islona commented Jan 6, 2025

In my opinion, this is a bad idea for any program with a request/response model like HTTP server.
Portals may get 500 errors (or analog) for no reason.

@kalabukdima
Copy link
Collaborator Author

They can get such errors at any time, so they already have to handle them. Falling back to this handling replaces two paths of execution with a single one, making it simpler to test it. And trying to make such faults less frequent may hide bugs in the fault-handling procedures on the portal side.

@mo4islona
Copy link

mo4islona commented Feb 17, 2025

  1. Programs are not written for the purity of its code or the purity of its architecture, programs are written for their use.
  2. Although our programs are free, we use them for commercial purposes.
  3. These two statements above clearly mean that our programs should be primarily focused on the end user, not the programmer who writes them.

A program that makes errors by design (and, of course, sometimes because it is broken), is a buggy program by design from the end user's point of view, which is not consistent with 3)

From the point of view of support and observability of such a program by other users of the company and/or departments, additional handling of situations when the program produces "designed" errors is necessary.

Which means that the programmer, having simplified the work for himself, has complicated it elsewhere and worsened its perception for the end user

For me it looks unacceptable.

@mo4islona
Copy link

mo4islona commented Feb 17, 2025

In this particular case, can we identify these errors separately (i.e. special code 5XX Shutting down) so that the portal can also categorize them separately?

This can be a good option that does not compromise the user experience.

@kalabukdima
Copy link
Collaborator Author

Wait, the point has never been about simplifying the work for the programmer — the graceful shutdown is already implemented, so it takes work to remove it.

The main point is that we're building a distributed system — the system composed of many components. Each component is unstable (may fail) — we don't even have control over the workers, but the goal is to provide a stable system, as a whole. Such a system inevitably has to deal with worker (or other components') failures without bothering the end user. So it's not possible to guarantee that the portal will never have failed queries — only the fact that they're communicating via the network already means a certain rate of errors.

Getting back to the workers, out of the total error rate E there is some (unknown) rate S of them caused by the worker being interrupted. In case of graceful shutdown, the portals don't get the errors, so they don't have to handle them. But if S/E gets close to 1, we still have to handle the rest of the errors, but we may forget to do so, e.g., during some code refactoring. Then it becomes a ticking time bomb because such error may happen after the release because it went unnoticed during the testing period.

So despite being controversial, the idea of "forcing failure of what is supposed to fail" may actually make the entire system more robust without the need for a separate crash-testing process. You may argue that we should better have a crash-testing process instead, and I would agree, but then it comes to the development time indeed. How many companies can afford something like Netflix's chaos monkey? But the end goal is to make the system more robust, not save on the development time.

@rmcmk
Copy link

rmcmk commented Feb 26, 2025

Just a users two cents—apologies if this isn’t the right place for it. I agree with @kalabukdima

In a distributed system, failures are inevitable, and handling them consistently makes the system more robust. Obfuscating failure cases doesn’t eliminate them—it just shifts the problem elsewhere.

You can still improve UX with better resiliency features (circuit breakers, if possible, load shedding, back-off, etc) and enhanced metrics, which are arguably far more valuable than a clean boot loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants