-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overload Protection #110
Comments
What about trying to rip out some redbug functionality for safer tracing |
@ruanpienaar I was considering it but concluded it would be too big of a change to do that, and maintain all of the features we already have. |
Trace limit could also be based on time too. For example for a webserver getting thousands of requests I might want to only trace for 20 seconds. Maybe it could be set globally or per tracer? |
@shezarkhani Would it still be vulnerable to overload in those 20 seconds? |
A lot of this is implemented in #112:
We still need a mode so that the process on the target node does minimal processing e.g. no record field highlighting or stack traces. Also some way of handling very large binaries or strings would mean that it would be much safer to handle many more messages. This probably means persisting large binaries to disk and showing only the first 100 bytes in the summary, the rest would be shown in the hex viewer. Something would need to be done to search these files when filtering. |
erlyberly should be safe for production, or at least as safe as we can make it.
Here are some thoughts on one possible solution. Comments welcome.
The UI crashes when too many traces are sent at once
I suspect there are some wasteful methods, converting OTP terms to strings to be shown in the table. This could be improved a lot with profiling.
Need to investigate how jinterface can handle overload. May need to modify jinterface.
The UI process can OOM because it does not remove traces unless commanded by the user.
The UI should also limit the number of traces that it can hold to prevent a java OOM, for example max 1000 trace logs and delete the oldest one if another comes in.
The collector processes message queue could grow if it cannot handle the number of traces
We could also have a separate pid that could monitor the amount of reductions the collector process took and stop dbg if it was too much. I don't know if that is useful, we should try the msg queue thing first.
The easiest thing I can think to do is for the collector process to check it's own msg queue size after it has sent a message and if it is too large then stop tracing immediately, and flush the queue immediately if it is too large. Once tracing is stopped and the msg queue it would need to send a special message to the UI that tracing has been suspended.
This needs to be prototyped and tested, redbug and recon_trace do not use this method.
Clean up on the remote node if the UI crashes
The collector process already has a node monitor on the erlyberly UI node and will stop tracing if it goes down, so that part is already safe.
Summary
We could do most of this just on the erlang side without touching the UI, which is good because erlang is much better at regulating load.
On the UI side we would need to add a "Safe Mode" check box on the connection window. In this mode, many features are disabled. For example, stack traces, annotating records with fields names for gen_server call backs, the process table, xref.
It would also need a button to reapply traces if they have been suspended.
Most of the work would be testing against production systems. And we could never guarantee that it was production safe. The ideas above are based on a protection against many small messages, but could it handle a few very large messages?
There would need to be a tagged version that we were sure was "production safe" and then the master branch would be considered bleeding edge.
The text was updated successfully, but these errors were encountered: