-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove graceful shutdown #29
Comments
In my opinion, this is a bad idea for any program with a request/response model like HTTP server. |
They can get such errors at any time, so they already have to handle them. Falling back to this handling replaces two paths of execution with a single one, making it simpler to test it. And trying to make such faults less frequent may hide bugs in the fault-handling procedures on the portal side. |
A program that makes errors by design (and, of course, sometimes because it is broken), is a buggy program by design from the end user's point of view, which is not consistent with 3) From the point of view of support and observability of such a program by other users of the company and/or departments, additional handling of situations when the program produces "designed" errors is necessary. Which means that the programmer, having simplified the work for himself, has complicated it elsewhere and worsened its perception for the end user For me it looks unacceptable. |
In this particular case, can we identify these errors separately (i.e. special code This can be a good option that does not compromise the user experience. |
Wait, the point has never been about simplifying the work for the programmer — the graceful shutdown is already implemented, so it takes work to remove it. The main point is that we're building a distributed system — the system composed of many components. Each component is unstable (may fail) — we don't even have control over the workers, but the goal is to provide a stable system, as a whole. Such a system inevitably has to deal with worker (or other components') failures without bothering the end user. So it's not possible to guarantee that the portal will never have failed queries — only the fact that they're communicating via the network already means a certain rate of errors. Getting back to the workers, out of the total error rate E there is some (unknown) rate S of them caused by the worker being interrupted. In case of graceful shutdown, the portals don't get the errors, so they don't have to handle them. But if S/E gets close to 1, we still have to handle the rest of the errors, but we may forget to do so, e.g., during some code refactoring. Then it becomes a ticking time bomb because such error may happen after the release because it went unnoticed during the testing period. So despite being controversial, the idea of "forcing failure of what is supposed to fail" may actually make the entire system more robust without the need for a separate crash-testing process. You may argue that we should better have a crash-testing process instead, and I would agree, but then it comes to the development time indeed. How many companies can afford something like Netflix's chaos monkey? But the end goal is to make the system more robust, not save on the development time. |
Just a users two cents—apologies if this isn’t the right place for it. I agree with @kalabukdima In a distributed system, failures are inevitable, and handling them consistently makes the system more robust. Obfuscating failure cases doesn’t eliminate them—it just shifts the problem elsewhere. You can still improve UX with better resiliency features (circuit breakers, if possible, load shedding, back-off, etc) and enhanced metrics, which are arguably far more valuable than a clean boot loop. |
Inspired by https://lwn.net/Articles/191059/
Worker is basically stateless, so removing the cancellation token will only simplify things
The text was updated successfully, but these errors were encountered: