Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dev]: Stale gathering detection #165

Open
TylerBloom opened this issue Sep 25, 2023 · 1 comment · May be fixed by #175
Open

[Dev]: Stale gathering detection #165

TylerBloom opened this issue Sep 25, 2023 · 1 comment · May be fixed by #175
Assignees
Labels
in progess Actively being worked on requirement A requirement for the next major release SquireCore Affects the SquireCore server SquireSDK Affects the SquireSDK library todo Will be resolved but work hasn't started

Comments

@TylerBloom
Copy link
Member

Unmet Need:

Currently, a Gathering will live forever after it is spawned. This is not ideal as it will cause memory exhaustion.

Solution:

The Gathering actor is the only component in the system that can "know" if it should be removed. There are two primary times when a gathering should be removed: when it has not processed a message in a while (~a day) or when all of its outbound connections have been closed. The second case is a bit harder to detect and would eventually lead to the first case, so that is where we will focus.

Proper implementation of detection and removal of a gathering will require three steps: bookkeeping of the number of messages a gathering processes, communication between the gathering and the gathering hall, and a message being sent to all clients that the WebSocket connection is being closed.

The bookkeeping step should focus on precision. If a gathering does not process a message for 24 hours, it should be removed as close to that point in time as possible. In other words, the gathering should not have a simple check that runs at midnight to do bookkeeping. However, bookkeeping should not be too costly, such as queueing a termination message immediately after it processes any other message.

The second step will require some slight reworking of how a gathering communicates with the gathering hall. Once a gathering has determined that it should be dispersed, it needs to communicate this with the gathering hall. This should trigger the gathering to (somehow) be dropped. How to achieve this is somewhat unclear as the current actor model assumes that the actor will run forever. Some design work on the actor model will be needed here.

The last step is mostly a courtesy. We could simply drop the websocket connections and let the SquireClient figure things out. This is less than ideal since a connection could be terminated for any number of reasons. So, we should explicitly communicate that the connection is being closed. However, the server has a mechanism to retry messages that fail to send over websockets. The termination message does not need to be retried.

Challenges/Considerations:

The biggest list of considerations is in the second step. Because it is assumed that an actor never dies, there are unwraps in several places. We need to ensure that we are not unwrapping things going into or coming out of the gatherings.

@TylerBloom TylerBloom added todo Will be resolved but work hasn't started SquireCore Affects the SquireCore server requirement A requirement for the next major release SquireSDK Affects the SquireSDK library labels Sep 25, 2023
@akbulutdora
Copy link
Contributor

Your planning seems well-thought and I am up for it. I plan to complete each step as a separate PR, and go for the first step now. Here is an initial implementation I am thinking about:

  1. The Gathering actor will keep a last_message_received_at timestamp.j
  2. After each message the Gathering receives, it will update last_message_received_at.
  3. The actor will check the time diff in certain intervals if it's been 24 hours since the last message. It will either schedule the next check or start taking action for the termination process. The checks can be scheduled in multiple ways. One I could think of is whenever a check occurs at TNOW, the next one will be scheduled at 24 hours after TNOW - (TNOW - last_message_received_at).
  4. The second step should begin.

I don't know if the current actor implementation can accommodate such a scheduling right now, I will read through. A simpler option could be just scheduling a check every 4 hours.

@akbulutdora akbulutdora added the in progess Actively being worked on label Oct 8, 2023
@akbulutdora akbulutdora linked a pull request Oct 15, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progess Actively being worked on requirement A requirement for the next major release SquireCore Affects the SquireCore server SquireSDK Affects the SquireSDK library todo Will be resolved but work hasn't started
Projects
Development

Successfully merging a pull request may close this issue.

2 participants