Overload Protection #110

andytill · 2016-05-24T18:10:41Z

erlyberly should be safe for production, or at least as safe as we can make it.

Here are some thoughts on one possible solution. Comments welcome.

The UI crashes when too many traces are sent at once

I suspect there are some wasteful methods, converting OTP terms to strings to be shown in the table. This could be improved a lot with profiling.

Need to investigate how jinterface can handle overload. May need to modify jinterface.

The UI process can OOM because it does not remove traces unless commanded by the user.

The UI should also limit the number of traces that it can hold to prevent a java OOM, for example max 1000 trace logs and delete the oldest one if another comes in.

The collector processes message queue could grow if it cannot handle the number of traces

We could also have a separate pid that could monitor the amount of reductions the collector process took and stop dbg if it was too much. I don't know if that is useful, we should try the msg queue thing first.

The easiest thing I can think to do is for the collector process to check it's own msg queue size after it has sent a message and if it is too large then stop tracing immediately, and flush the queue immediately if it is too large. Once tracing is stopped and the msg queue it would need to send a special message to the UI that tracing has been suspended.

This needs to be prototyped and tested, redbug and recon_trace do not use this method.

Clean up on the remote node if the UI crashes

The collector process already has a node monitor on the erlyberly UI node and will stop tracing if it goes down, so that part is already safe.

Summary

We could do most of this just on the erlang side without touching the UI, which is good because erlang is much better at regulating load.

On the UI side we would need to add a "Safe Mode" check box on the connection window. In this mode, many features are disabled. For example, stack traces, annotating records with fields names for gen_server call backs, the process table, xref.

It would also need a button to reapply traces if they have been suspended.

Most of the work would be testing against production systems. And we could never guarantee that it was production safe. The ideas above are based on a protection against many small messages, but could it handle a few very large messages?

There would need to be a tagged version that we were sure was "production safe" and then the master branch would be considered bleeding edge.

ruanpienaar · 2016-05-24T20:42:22Z

What about trying to rip out some redbug functionality for safer tracing

andytill · 2016-05-24T21:12:35Z

@ruanpienaar I was considering it but concluded it would be too big of a change to do that, and maintain all of the features we already have.

gootik · 2016-05-25T23:47:51Z

Trace limit could also be based on time too. For example for a webserver getting thousands of requests I might want to only trace for 20 seconds. Maybe it could be set globally or per tracer?

andytill · 2016-05-26T08:55:42Z

@shezarkhani Would it still be vulnerable to overload in those 20 seconds?

andytill · 2016-07-31T12:06:46Z

A lot of this is implemented in #112:

Tracing stopped if the message queue of the tracer process grows too high.
UI drops trace logs if there are too many.
Indication that tracing has been suspended, and a button to re-enable it.

We still need a mode so that the process on the target node does minimal processing e.g. no record field highlighting or stack traces.

Also some way of handling very large binaries or strings would mean that it would be much safer to handle many more messages. This probably means persisting large binaries to disk and showing only the first 100 bytes in the summary, the rest would be shown in the hex viewer. Something would need to be done to search these files when filtering.

andytill added enhancement java erlang labels May 24, 2016

andytill mentioned this issue May 29, 2016

Overload Protection #112

Merged

andytill changed the title ~~Make erlyberly safer for production~~ Overload Protection Jun 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overload Protection #110

Overload Protection #110

andytill commented May 24, 2016 •

edited

Loading

ruanpienaar commented May 24, 2016

andytill commented May 24, 2016

gootik commented May 25, 2016

andytill commented May 26, 2016 •

edited

Loading

andytill commented Jul 31, 2016

Overload Protection #110

Overload Protection #110

Comments

andytill commented May 24, 2016 • edited Loading

The UI crashes when too many traces are sent at once

The UI process can OOM because it does not remove traces unless commanded by the user.

The collector processes message queue could grow if it cannot handle the number of traces

Clean up on the remote node if the UI crashes

Summary

ruanpienaar commented May 24, 2016

andytill commented May 24, 2016

gootik commented May 25, 2016

andytill commented May 26, 2016 • edited Loading

andytill commented Jul 31, 2016

andytill commented May 24, 2016 •

edited

Loading

andytill commented May 26, 2016 •

edited

Loading