-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threaded telemetry #436
base: main
Are you sure you want to change the base?
Threaded telemetry #436
Conversation
@AVHopp @Scienfitz: Here a little quality-of-life PR solving a problem that kept annoying me since quite a while. Note: I still need to run a final test once the telemetry sync is back up. But we can already prepare the PR for merge in the meantime. |
b4b741b
to
201b52c
Compare
before I review this: can you elaborate how telemetry so far could be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turning your comment into a thread:
TBH, it's hard for me to exactly explain the problem. All I can say: I can't reproduce it at home, but I regularly encounter the problem on the train, probably during the moments where the mobile connection is slow / lost. So my mac still "sees" a connection, but can't reach then endpoint. And the process is always the same:
- I'm working on the code
- Suddenly, stuff does not execute anymore
- I think I broke something and start debugging
- A few mins later, I realize: connection problems.
- I simply disable WIFI and execution works again.
So the problem is not telemetry itself but some of our code parts that are blocking. I can try to reproduce and figure out next time I'm on the train and run into the problem.
But regardless of whether we keep the threading approach or not, we can still leverage some of the improvements of this PR, i.e. deletion of dead code, lazy telemetry loading, cleanup into small functions, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not being able to reproduce and not knowing the exact source of the issue, were you already able to verify the new solution fixes your issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think it cannot "not" fix the issue since there is no nothing left that could be blocking. At least in principle, the entire execution of the original telemetry.py
was in the main thread, so whatever happened there would block the further execution of the baybe code (not talking about the actual telemetry calls but the pure import of the module). This is now all gone.
But we can delay the discussion until I have more evidence, if you prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really dont like this approach coding auf sicht
if you know the issue we can merge without test, based on clearly seeing the cause being fixed in the code
if we dont know the code its not even clear any of the code in this PR addresses anything relevant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo, this should be discussed in our meeting instead of here.
@AdrianSosic just as reminder, we need to perform a minimal check whether the changed implementation still submits telemetry as expected. We can do this together in a short meeting where I check out the factory while you trigger telemetry simulation |
This PR improves the telemetry mechanism such that all telemetry-related actions now run in an isolated daemon thread.
Main problem addressed
In the previous version, telemetry actions ran in the main thread, blocking the execution of the actual BayBE code. This was especially noticeable in situations where the endpoint is unreachable (generally for external users or when code was executed without internet connection, e.g. me sitting on the train), resulting in longer turnaround times and often giving the impression that "nothing happens".
Solution
In the new version, the main thread interacts with telemetry only via an event queue, giving an entirely frictionless experience.
Other improvements