Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threaded telemetry #436

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Threaded telemetry #436

wants to merge 17 commits into from

Conversation

AdrianSosic
Copy link
Collaborator

This PR improves the telemetry mechanism such that all telemetry-related actions now run in an isolated daemon thread.

Main problem addressed

In the previous version, telemetry actions ran in the main thread, blocking the execution of the actual BayBE code. This was especially noticeable in situations where the endpoint is unreachable (generally for external users or when code was executed without internet connection, e.g. me sitting on the train), resulting in longer turnaround times and often giving the impression that "nothing happens".

Solution

In the new version, the main thread interacts with telemetry only via an event queue, giving an entirely frictionless experience.

Other improvements

  • The monolithic telemetry script has been refactored into easy-to-digest pieces. Furthermore, the separation into threads results an overall cleaner architecture, which also offers more flexibility. For example, things could now be easily extended to disable telemetry only temporarily, if needed.
  • Telemetry packages are now loaded lazily, avoiding imports altogether if connection is not possible.
  • Telemetry functionality has been split into "public" and "private" parts, providing a clean API that can be used inside BayBE code (while keeping the aforementioned benefit of lazy imports).

@AdrianSosic AdrianSosic added the enhancement Expand / change existing functionality label Nov 27, 2024
@AdrianSosic AdrianSosic self-assigned this Nov 27, 2024
@AdrianSosic
Copy link
Collaborator Author

@AVHopp @Scienfitz: Here a little quality-of-life PR solving a problem that kept annoying me since quite a while. Note: I still need to run a final test once the telemetry sync is back up. But we can already prepare the PR for merge in the meantime.

@AdrianSosic AdrianSosic force-pushed the feature/threaded_telemetry branch from b4b741b to 201b52c Compare November 27, 2024 10:21
@Scienfitz
Copy link
Collaborator

before I review this: can you elaborate how telemetry so far could be blocking? As far as I understand all requests handled by the telemetry dependencies are already done asynchonously and even in the scenario described (going from a zone of previously established connectivity to a no-connectivity zone) had onlye ver resulted in prints, never blocking of code execution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turning your comment into a thread:

TBH, it's hard for me to exactly explain the problem. All I can say: I can't reproduce it at home, but I regularly encounter the problem on the train, probably during the moments where the mobile connection is slow / lost. So my mac still "sees" a connection, but can't reach then endpoint. And the process is always the same:

  • I'm working on the code
  • Suddenly, stuff does not execute anymore
  • I think I broke something and start debugging
  • A few mins later, I realize: connection problems.
  • I simply disable WIFI and execution works again.

So the problem is not telemetry itself but some of our code parts that are blocking. I can try to reproduce and figure out next time I'm on the train and run into the problem.

But regardless of whether we keep the threading approach or not, we can still leverage some of the improvements of this PR, i.e. deletion of dead code, lazy telemetry loading, cleanup into small functions, ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not being able to reproduce and not knowing the exact source of the issue, were you already able to verify the new solution fixes your issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think it cannot "not" fix the issue since there is no nothing left that could be blocking. At least in principle, the entire execution of the original telemetry.py was in the main thread, so whatever happened there would block the further execution of the baybe code (not talking about the actual telemetry calls but the pure import of the module). This is now all gone.

But we can delay the discussion until I have more evidence, if you prefer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dont like this approach coding auf sicht

if you know the issue we can merge without test, based on clearly seeing the cause being fixed in the code

if we dont know the code its not even clear any of the code in this PR addresses anything relevant

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo, this should be discussed in our meeting instead of here.

@Scienfitz
Copy link
Collaborator

@AdrianSosic just as reminder, we need to perform a minimal check whether the changed implementation still submits telemetry as expected. We can do this together in a short meeting where I check out the factory while you trigger telemetry simulation
If that works then Im happy for review and merge of this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Expand / change existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants