[Dev]: Stale gathering detection #165
Labels
in progess
Actively being worked on
requirement
A requirement for the next major release
SquireCore
Affects the SquireCore server
SquireSDK
Affects the SquireSDK library
todo
Will be resolved but work hasn't started
Unmet Need:
Currently, a
Gathering
will live forever after it is spawned. This is not ideal as it will cause memory exhaustion.Solution:
The
Gathering
actor is the only component in the system that can "know" if it should be removed. There are two primary times when a gathering should be removed: when it has not processed a message in a while (~a day) or when all of its outbound connections have been closed. The second case is a bit harder to detect and would eventually lead to the first case, so that is where we will focus.Proper implementation of detection and removal of a gathering will require three steps: bookkeeping of the number of messages a gathering processes, communication between the gathering and the gathering hall, and a message being sent to all clients that the WebSocket connection is being closed.
The bookkeeping step should focus on precision. If a gathering does not process a message for 24 hours, it should be removed as close to that point in time as possible. In other words, the gathering should not have a simple check that runs at midnight to do bookkeeping. However, bookkeeping should not be too costly, such as queueing a termination message immediately after it processes any other message.
The second step will require some slight reworking of how a gathering communicates with the gathering hall. Once a gathering has determined that it should be dispersed, it needs to communicate this with the gathering hall. This should trigger the gathering to (somehow) be dropped. How to achieve this is somewhat unclear as the current actor model assumes that the actor will run forever. Some design work on the actor model will be needed here.
The last step is mostly a courtesy. We could simply drop the websocket connections and let the
SquireClient
figure things out. This is less than ideal since a connection could be terminated for any number of reasons. So, we should explicitly communicate that the connection is being closed. However, the server has a mechanism to retry messages that fail to send over websockets. The termination message does not need to be retried.Challenges/Considerations:
The biggest list of considerations is in the second step. Because it is assumed that an actor never dies, there are unwraps in several places. We need to ensure that we are not unwrapping things going into or coming out of the gatherings.
The text was updated successfully, but these errors were encountered: