[Dev]: Stale gathering detection #165

TylerBloom · 2023-09-25T20:13:48Z

Unmet Need:

Currently, a Gathering will live forever after it is spawned. This is not ideal as it will cause memory exhaustion.

Solution:

The Gathering actor is the only component in the system that can "know" if it should be removed. There are two primary times when a gathering should be removed: when it has not processed a message in a while (~a day) or when all of its outbound connections have been closed. The second case is a bit harder to detect and would eventually lead to the first case, so that is where we will focus.

Proper implementation of detection and removal of a gathering will require three steps: bookkeeping of the number of messages a gathering processes, communication between the gathering and the gathering hall, and a message being sent to all clients that the WebSocket connection is being closed.

The bookkeeping step should focus on precision. If a gathering does not process a message for 24 hours, it should be removed as close to that point in time as possible. In other words, the gathering should not have a simple check that runs at midnight to do bookkeeping. However, bookkeeping should not be too costly, such as queueing a termination message immediately after it processes any other message.

The second step will require some slight reworking of how a gathering communicates with the gathering hall. Once a gathering has determined that it should be dispersed, it needs to communicate this with the gathering hall. This should trigger the gathering to (somehow) be dropped. How to achieve this is somewhat unclear as the current actor model assumes that the actor will run forever. Some design work on the actor model will be needed here.

The last step is mostly a courtesy. We could simply drop the websocket connections and let the SquireClient figure things out. This is less than ideal since a connection could be terminated for any number of reasons. So, we should explicitly communicate that the connection is being closed. However, the server has a mechanism to retry messages that fail to send over websockets. The termination message does not need to be retried.

Challenges/Considerations:

The biggest list of considerations is in the second step. Because it is assumed that an actor never dies, there are unwraps in several places. We need to ensure that we are not unwrapping things going into or coming out of the gatherings.

The text was updated successfully, but these errors were encountered:

akbulutdora · 2023-09-26T08:29:17Z

Your planning seems well-thought and I am up for it. I plan to complete each step as a separate PR, and go for the first step now. Here is an initial implementation I am thinking about:

The Gathering actor will keep a last_message_received_at timestamp.j
After each message the Gathering receives, it will update last_message_received_at.
The actor will check the time diff in certain intervals if it's been 24 hours since the last message. It will either schedule the next check or start taking action for the termination process. The checks can be scheduled in multiple ways. One I could think of is whenever a check occurs at TNOW, the next one will be scheduled at 24 hours after TNOW - (TNOW - last_message_received_at).
The second step should begin.

I don't know if the current actor implementation can accommodate such a scheduling right now, I will read through. A simpler option could be just scheduling a check every 4 hours.

TylerBloom added todo Will be resolved but work hasn't started SquireCore Affects the SquireCore server requirement A requirement for the next major release SquireSDK Affects the SquireSDK library labels Sep 25, 2023

TylerBloom assigned akbulutdora Sep 25, 2023

TylerBloom added this to Squire Tournament Services Sep 25, 2023

akbulutdora added the in progess Actively being worked on label Oct 8, 2023

akbulutdora linked a pull request Oct 15, 2023 that will close this issue

165 dev stale gathering detection #175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev]: Stale gathering detection #165

[Dev]: Stale gathering detection #165

TylerBloom commented Sep 25, 2023

akbulutdora commented Sep 26, 2023

[Dev]: Stale gathering detection #165

[Dev]: Stale gathering detection #165

Comments

TylerBloom commented Sep 25, 2023

Unmet Need:

Solution:

Challenges/Considerations:

akbulutdora commented Sep 26, 2023