Remove graceful shutdown #29

kalabukdima · 2025-01-06T13:08:08Z

Inspired by https://lwn.net/Articles/191059/
Worker is basically stateless, so removing the cancellation token will only simplify things

mo4islona · 2025-01-06T18:44:51Z

In my opinion, this is a bad idea for any program with a request/response model like HTTP server.
Portals may get 500 errors (or analog) for no reason.

kalabukdima · 2025-02-17T10:12:02Z

They can get such errors at any time, so they already have to handle them. Falling back to this handling replaces two paths of execution with a single one, making it simpler to test it. And trying to make such faults less frequent may hide bugs in the fault-handling procedures on the portal side.

mo4islona · 2025-02-17T16:14:56Z

Programs are not written for the purity of its code or the purity of its architecture, programs are written for their use.
Although our programs are free, we use them for commercial purposes.
These two statements above clearly mean that our programs should be primarily focused on the end user, not the programmer who writes them.

A program that makes errors by design (and, of course, sometimes because it is broken), is a buggy program by design from the end user's point of view, which is not consistent with 3)

From the point of view of support and observability of such a program by other users of the company and/or departments, additional handling of situations when the program produces "designed" errors is necessary.

Which means that the programmer, having simplified the work for himself, has complicated it elsewhere and worsened its perception for the end user

For me it looks unacceptable.

mo4islona · 2025-02-17T17:42:07Z

In this particular case, can we identify these errors separately (i.e. special code 5XX Shutting down) so that the portal can also categorize them separately?

This can be a good option that does not compromise the user experience.

kalabukdima · 2025-02-18T11:17:47Z

Wait, the point has never been about simplifying the work for the programmer — the graceful shutdown is already implemented, so it takes work to remove it.

The main point is that we're building a distributed system — the system composed of many components. Each component is unstable (may fail) — we don't even have control over the workers, but the goal is to provide a stable system, as a whole. Such a system inevitably has to deal with worker (or other components') failures without bothering the end user. So it's not possible to guarantee that the portal will never have failed queries — only the fact that they're communicating via the network already means a certain rate of errors.

Getting back to the workers, out of the total error rate E there is some (unknown) rate S of them caused by the worker being interrupted. In case of graceful shutdown, the portals don't get the errors, so they don't have to handle them. But if S/E gets close to 1, we still have to handle the rest of the errors, but we may forget to do so, e.g., during some code refactoring. Then it becomes a ticking time bomb because such error may happen after the release because it went unnoticed during the testing period.

So despite being controversial, the idea of "forcing failure of what is supposed to fail" may actually make the entire system more robust without the need for a separate crash-testing process. You may argue that we should better have a crash-testing process instead, and I would agree, but then it comes to the development time indeed. How many companies can afford something like Netflix's chaos monkey? But the end goal is to make the system more robust, not save on the development time.

rmcmk · 2025-02-26T15:59:55Z

Just a users two cents—apologies if this isn’t the right place for it. I agree with @kalabukdima

In a distributed system, failures are inevitable, and handling them consistently makes the system more robust. Obfuscating failure cases doesn’t eliminate them—it just shifts the problem elsewhere.

You can still improve UX with better resiliency features (circuit breakers, if possible, load shedding, back-off, etc) and enhanced metrics, which are arguably far more valuable than a clean boot loop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove graceful shutdown #29

Remove graceful shutdown #29

kalabukdima commented Jan 6, 2025

mo4islona commented Jan 6, 2025 •

edited

Loading

kalabukdima commented Feb 17, 2025

mo4islona commented Feb 17, 2025 •

edited

Loading

mo4islona commented Feb 17, 2025 •

edited

Loading

kalabukdima commented Feb 18, 2025

rmcmk commented Feb 26, 2025

Remove graceful shutdown #29

Remove graceful shutdown #29

Comments

kalabukdima commented Jan 6, 2025

mo4islona commented Jan 6, 2025 • edited Loading

kalabukdima commented Feb 17, 2025

mo4islona commented Feb 17, 2025 • edited Loading

mo4islona commented Feb 17, 2025 • edited Loading

kalabukdima commented Feb 18, 2025

rmcmk commented Feb 26, 2025

mo4islona commented Jan 6, 2025 •

edited

Loading

mo4islona commented Feb 17, 2025 •

edited

Loading

mo4islona commented Feb 17, 2025 •

edited

Loading