feat(replays): Add buffered consumer implementation #85356

cmanallen · 2025-02-18T14:53:49Z

TODO description...

Partially addresses the DACI: https://www.notion.so/sentry/DACI-Session-Replay-Recording-Consumer-Stability-and-Performance-Improvements-19e8b10e4b5d80a192a1ecd46f13eebb

How it works:

As messages come in they are processed and their processed results are stored on a queue.
When the queue fills up the processed messages are flushed.
- Flushing involves committing data to GCS, ClickHouse, BigQuery, DataDog.
- Anything I/O related.
Flushing happens in a thread-pool.

This closely mirrors current production behavior except processing is no longer done in the thread-pool.

This PR also introduces a new RunTime abstraction for managing state changes in the consumer. Which I will document in a Notion doc.

…ys-consumer-separate-processing-from-io

… into cmanallen/replays-optimize-consumer

codecov · 2025-02-20T19:03:39Z

Codecov Report

Attention: Patch coverage is 97.70115% with 10 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/sentry/replays/consumers/buffered/platform.py	95.72%	5 Missing ⚠️
...ests/sentry/replays/unit/consumers/test_helpers.py	88.00%	3 Missing ⚠️
src/sentry/replays/consumers/buffered/consumer.py	98.79%	1 Missing ⚠️
...ests/sentry/replays/unit/consumers/test_runtime.py	98.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #85356    +/-   ##
========================================
  Coverage   87.88%   87.88%            
========================================
  Files        9714     9720     +6     
  Lines      550639   551075   +436     
  Branches    21449    21449            
========================================
+ Hits       483917   484334   +417     
- Misses      66342    66361    +19     
  Partials      380      380

…d-io

…anallen/replays-optimize-consumer

…e buffer

fpacifici

I would advise against building a consumer with this abstraction and make it work on top of Arroyo.

Arroyo provides an abstraction level that is similar to the one provided by your PlatformStrategy. The two, though, go in fundamentally different directions. Building one on top of the other seems to create a quite complex architecture. It is quite hard to intuitively tell whether the system guarantees at least once delivery.
We will not be able to move parts of the processing on a multi process pool in case it is needed for scale as that qould require using the RunTask abstraction.
As we discussed in the past. I believe the dataflow model (even the small subset provided by arroyo) makes this kind of applications simpler to understand due to the sequential nature of the pipeline.

fpacifici · 2025-03-03T06:11:27Z

src/sentry/replays/consumers/buffered/factory.py

+    def create_with_partitions(
+        self,
+        commit: ArroyoCommit,
+        partitions: Mapping[Partition, int],
+    ) -> ProcessingStrategy[KafkaPayload]:
+        return PlatformStrategy(commit=commit, flags=self.flags, runtime=recording_runtime)


Wouldn't it be considerably simpler to model this consumer as a sequence of these Arroyo operators:

Batch the messages with the Reduce primitive (https://getsentry.github.io/arroyo/strategies/reduce.html#reduce-fold)

Run Task in either multi threading or multi processing (https://getsentry.github.io/arroyo/strategies/reduce.html#reduce-fold) to perform the parsing and the commit_recording_message ?

Modeling the system this way would:

allow the parallelism either via processes and threads without application logic changes

guarantee a pipeline approach that allows the batching step to keep batching new messages while the worker thread performs its work.

hide offset management entirely from the application code.

src/sentry/replays/consumers/buffered/lib.py

fpacifici · 2025-03-03T06:35:34Z

src/sentry/replays/consumers/buffered/consumer.py

+    def can_flush(self, model: Model[ProcessedRecordingMessage]) -> bool:
+        # TODO: time.time is stateful and hard to test. We should enable the RunTime to perform
+        #       managed effects so we can properly test this behavior.
+        return (
+            len(model.buffer) >= self.__max_buffer_length
+            or (time.time() - self.__max_buffer_wait) >= self.__last_flushed_at
+        )


Arroyo primitives manage these kind of concerns for you (when to flush a batch for example). Are you sure about the idea of pushing them into the product code instead ?

cmanallen · 2025-03-03T17:41:30Z

@fpacifici Thanks for the feedback! Let me preface by saying when I opened this PR I did so for three reasons:

To demonstrate the behaviors I outlined in my streaming platform proposal so you would have something tangible to reference.
To explore new ideas as part of a research project. The goal is exploration so using Arroyo primitives would have gone against my goals.
I'm trying to solve a real problem from a pre-determined starting point with constraints on what can change and how fast.

So there are practical concerns intermixed with educational and exploratory concerns. I've since dropped the educational component and simplified what I'm doing in the PR with this morning's commits (it mostly doesn't impact your review -- but it did simplify the consumer implementation so it might be worth a second look). Hopefully that explains why I made some of the choices I did.

Concerns Around Committing

I agree completely. They were a pain to manage and it was never necessary to manage them in application code (for my use case at least). I moved the offset handling into the RunTime where they're now managed at the platform level.

Why Not Use Arroyo

Its a research project so we're trying something new. But, critically, this is Arroyo. Its not a full implementation but it is an Arroyo strategy. That means I can prefix the step with any number of streaming primitives and I could suffix the step with any number of streaming primitives (could being the operative word because I'm currently hard coding the commit step as the next step -- that's an easy fix so I'm ignoring this oversight).

This RunTime strategy is a generalization of all Arroyo strategies so its not surprising that specialized strategies exist that can solve components of this pipeline. Research aside I do want this to go to production and I'll mention why at the bottom.

So the three concerns I'm keying in on are:

allow the parallelism either via processes and threads without application logic changes
guarantee a pipeline approach that allows the batching step to keep batching new messages while the worker thread performs its work.
hide offset management entirely from the application code.

Three is gone; I've removed it. One is partially solved I think. We can prefix the RunTime step with a multi-threading/processing step. I'm not sure if the RunTime can be embedded into those steps so that may be a shortcoming. That's an area I could look into if this was ever an important component of Arroyo.

Two is not solved as far as I'm aware. I wrote the Buffer strategy in Arroyo and the Reduce strategy implements the Buffer strategy. I'm not aware how those strategies flush their buffers in a worker thread. As far as I know they block the main thread. But if that's not the case and, there is some platform magic happening, then I don't see why the RunTime strategy couldn't also have the benefit of flushing off the main thread.

I want this to go to production; why? tl;dr I can now unit-test my consumer end-to-end.

Testing is a huge concern for me. I've refactored this consumer in the past and its led to production outages (using Arroyo streaming primitives as it happens -- which I don't blame).

There are minor things that Arroyo does that can make testing more difficult. For example in the Reduce strategy we call time.time(). That can make unit-testing difficult. You have to mock or just not test certain behaviors which is not ideal. However, there's a larger problem that I'm trying to address and that's the difficulty in testing how state is threaded through a consumer. Arroyo does not provide any facilities for this and my PR did not have any until this morning.

One of the benefits of managing the state machine in the way I have is that I can intercept the commands being issued by my application and rewrite them. You can see in the MockRunTime class I'm using coroutines to rewrite commands in my test suite. This gives me a lot of insight into what the application is doing and the ability to redirect behavior in a way that does not require monkey patching. I can deterministically simulate all possible states and assert the outcome very cheaply.

My implementation isn't perfect (I haven't abstracted all state yet) but there's already been a significant uplift in what I'm capable of asserting about my software.

There are other reasons but this has already gotten too long so I'll leave it there. Let me know if I addressed your concerns well enough and thanks again for taking the time to review this!

Add buffered consumer runtime implementation

b8cb413

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Feb 18, 2025

vercel bot deployed to Preview February 18, 2025 14:56 View deployment

cmanallen added 10 commits February 18, 2025 20:27

Begin adding process logic

d82c589

Add tracing to each component of recording processing

e82ce0f

Delete unused function

55cc3af

Report size metrics

a243bdb

Merge branch 'cmanallen/replays-improve-tracing' into cmanallen/repla…

3d728b5

…ys-consumer-separate-processing-from-io

Separate IO from processing

890f170

Add explicit return

2dcfaec

Merge branch 'cmanallen/replays-consumer-separate-processing-from-io'…

9e4e227

… into cmanallen/replays-optimize-consumer

Add buffer managers

db5fa11

Write FilePart rows and adopt a new subscription model

7a5a98c

vercel bot deployed to Preview February 20, 2025 19:03 View deployment

cmanallen added 13 commits February 20, 2025 14:52

Add unit tests

44c899b

Add contextual errors

de76ad0

Misc test updates

abf22cf

Fully separate compute and io within the recording consumer

2e16a01

Configure max workers

208360f

Merge branch 'master' into cmanallen/replays-add-separated-compute-an…

13ea6ac

…d-io

Remove conditional branch as its moved further up the hierarchy

fc8df9c

Merge branch 'cmanallen/replays-add-separated-compute-and-io' into cm…

b0f518c

…anallen/replays-optimize-consumer

Use context manager

536c250

Simplify buffer flushing (for now)

ea571fa

Merge branch 'master' into cmanallen/replays-optimize-consumer

1826ccb

Update tracing logic

3733886

Soften flag requirements and minor fixes

88747c1

vercel bot deployed to Preview February 27, 2025 15:23 View deployment

Remove buffer managers module

74e811f

Add docs

2572955

vercel bot deployed to Preview February 27, 2025 20:41 View deployment

More docstring fixes

7ca7462

vercel bot deployed to Preview February 27, 2025 21:07 View deployment

Add typing to flags and factory module

fb5731e

vercel bot deployed to Preview February 27, 2025 22:18 View deployment

Adopt buffered strategy in callsite

17eb011

vercel bot deployed to Preview February 28, 2025 01:59 View deployment

cmanallen added 5 commits February 28, 2025 09:29

Add script for mocking recordings

0c79eac

Add handling for appending offsets when the message is not buffered

30ed5d6

Add commit coverage

52d8616

Assert messages are committed regardless of if they're appended to th…

b334400

…e buffer

Fix typing

8618e29

cmanallen marked this pull request as ready for review February 28, 2025 16:58

cmanallen requested review from a team as code owners February 28, 2025 16:58

cmanallen added 3 commits February 28, 2025 15:11

Docstrings

0346a3c

More docstrings

4ab0ff2

Yet more docstrings

77c2971

vercel bot deployed to Preview February 28, 2025 21:23 View deployment

fpacifici reviewed Mar 3, 2025

View reviewed changes

cmanallen added 2 commits March 3, 2025 08:06

Implement declarative effect management

b956ad9

Merge branch 'master' into cmanallen/replays-optimize-consumer

5e2ce4f

vercel bot deployed to Preview March 3, 2025 14:15 View deployment

Move offset management into the runtime

fa42004

vercel bot deployed to Preview March 3, 2025 14:38 View deployment

Fix typing

2215007

vercel bot deployed to Preview March 3, 2025 14:43 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(replays): Add buffered consumer implementation #85356

feat(replays): Add buffered consumer implementation #85356

cmanallen commented Feb 18, 2025 •

edited

Loading

codecov bot commented Feb 20, 2025 •

edited

Loading

fpacifici left a comment

fpacifici Mar 3, 2025

fpacifici Mar 3, 2025

cmanallen commented Mar 3, 2025

feat(replays): Add buffered consumer implementation #85356

Are you sure you want to change the base?

feat(replays): Add buffered consumer implementation #85356

Conversation

cmanallen commented Feb 18, 2025 • edited Loading

codecov bot commented Feb 20, 2025 • edited Loading

Codecov Report

fpacifici left a comment

Choose a reason for hiding this comment

fpacifici Mar 3, 2025

Choose a reason for hiding this comment

fpacifici Mar 3, 2025

Choose a reason for hiding this comment

cmanallen commented Mar 3, 2025

Concerns Around Committing

Why Not Use Arroyo

cmanallen commented Feb 18, 2025 •

edited

Loading

codecov bot commented Feb 20, 2025 •

edited

Loading