Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backpressure Management #2410

Open
mattjohnsonpint opened this issue Jun 2, 2023 · 6 comments
Open

Backpressure Management #2410

mattjohnsonpint opened this issue Jun 2, 2023 · 6 comments
Labels
Feature New feature or request

Comments

@mattjohnsonpint
Copy link
Contributor

mattjohnsonpint commented Jun 2, 2023

Problem Statement

The internal BackgroundWorker class manages back-pressure for the SDK. When events, transactions, sessions, etc. are captured, the SentryClient adds them to a ConcurrentQueue within the background worker. The background worker then has a separate task that is responsible for reading from the queue and sending events to the ITransport implementation (typically that's the HttpTransport).

This is generally a good design. However, it suffers from the limitation that there's only one queue for all envelope types, non-prioritized, with a single limit set by SentryOptions.MaxQueueItems - defaulting to 30. If the queue is full, envelopes are dropped when captured, never being added to the queue. Therefore, it's quite possible (especially in a server environment) that a large flood of one type of envelope (such as transactions) can prevent another more important type of envelope (such as error events) from being sent to Sentry.

Solution Brainstorm

We need to support concurrency on the producer side, so the built-in PriortyQueue class is out. We need to be non-blocking on the consumer side, so BlockingCollection is also out. There is no built-in PriortyConcurrentQueue class. Thus, to implement prioritization, we will probably need to use multiple ConcurrentQueue instances in the background worker - perhaps in a dictionary where the key is the envelope type.

We also should have more than one option for controlling the maximum queue size (this might be separate properties, or a dictionary). Currently we just have MaxQueueItems - which is undocumented.

We should probably set higher default queue sizes for ASP.NET and ASP.NET Core. Currently, MaxQueueSize defaults to 30. We do show setting it higher in some of the sample apps, but with no explanation.

Whatever is decided in getsentry/team-sdks#10 should be taken into consideration also.

We potentially have a solution in this spike (although it may need tweaking):

References

@ericsampson
Copy link

What about using TPL Dataflow?

@bitsandfoxes bitsandfoxes changed the title Improved background worker Backpressure Management Mar 4, 2024
@jamescrosswell
Copy link
Collaborator

The python implementation appears to use a combination of whether the queue is full or whether requests have been rate limited to adjust a downsample factor. This is then used to reduce the number of traces that get captured.

Curiously, it looks like rate limits across any category would cause downsampling (not just rate limits applied to traces). It would be good to understand if that was deliberate and, if so, what the thinking was behind that design choice.

@bitsandfoxes are we looking to mimic what the Python SDK is doing in .NET or are we also considering strategies like the one Matt originally proposed (some way to implement priority queues).

@bitsandfoxes
Copy link
Contributor

We do have some docs on backpressure.

Curiously, it looks like rate limits across any category would cause downsampling (not just rate limits applied to traces). It would be good to understand if that was deliberate and, if so, what the thinking was behind that design choice.

If traces get rate limited errors get downsampled?

@jamescrosswell
Copy link
Collaborator

If traces get rate limited errors get downsampled?

No the other way around... Downsampling only affects traces in the Python implementation. However if Metrics or Errors get rate limited, this would cause traces to be downsampled, if I've understood that code correctly.

@bitsandfoxes
Copy link
Contributor

After talking to the other teams I think we should mimic the other SDKs.
My reasoning:

  1. It's a lot easier and simpler
  2. It's already out there in the wild and proven to work well (enough)
  3. The idea is to still monitor and ease up on "unhealthy" systems. Transactions are the most likely culprit.
  4. The main target/scenario we'd like this to work really well is spike protection.
  5. It's not just so much to "protect" Relay from too many requests. But if the client is in such a bad state that it has trouble sending events (i.e. the queue is overflowing) we'd ideally no-op as much as possible as to not add the the trauma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature or request
Projects
Status: No status
Archived in project
Development

No branches or pull requests

4 participants