Replies: 1 comment 1 reply
-
The original RCP-027 does not state that eventId>0 queries must give you the entire set of events for an object since it began. It is completely silent on this. If we are proposing that all server vendors MUST support this, this should be a different RCP that suggests this behaviour as many vendors will not be able to support this easily. Yes, there are many ways to do this, some easier than others, but I feel forcing this as a requirement is going to burden some server vendors with much additional work that they don't have on their roadmaps. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Summary
One of the topics at the RESO 2023 Summer Developers Workshop was how to initialize systems using the EntityEvent Resource (log).
There were some questions around sliding windows and what it might mean if the log were to become large.
Initialization
Initialization was originally discussed in the Replication Subgroup (Transport), which is where the RCP-027 and RCP-028 proposals came from.
The thinking at the time was that data consumers needed a simple and reliable way to reinitialize their system or initialize new ones, in addition to syncing, from a given feed. Reading these items from a log at their own pace and picking up the corresponding records gives them a way to do so in a predictable manner.
Asking for records greater than or equal to 0 always works for initialization and asking for those greater than the last item they caught up with always works for syncing.
Requesting a given sequence number in the log should always return the next item in the sequence the consumer has access to if the item they requested is either not present or not visible to them. For example, if they ask for events greater than 41 and 42 isn't visible to them but 43 is, then 43 would be returned.
Initialization
Syncing
In either case, the client-side logic is the same. They read from the log, then request and process the corresponding records.
Implementation Considerations
For providers with large amounts of data and/or very active systems, the number of records in the log can grow very large.
There are a couple of things built into the RCP-027 specification that should help a bit:
Potential Implementations
There are a lot of ways log-based replication could be implemented on the backend.
At the time the specification was written, Kafka was popular, which allows for a durable log that can grow very large and also supports compaction. Many cloud-based solutions enforce an expiration on event streams.
The goal in this section is to explore other options.
Storing Logs in Databases
Perhaps the simplest way to store a log is to do so in a DBMS or NoSQL database that supports numerical identity columns. These meet the requirements of being monotonically increasing, and could be compacted according to the method outlined above so there's only ever at most one (latest) entry for each record in the system.
Serverless
Serverless backends have become more popular since the RCP-027 specification was written, and RESO even uses them itself for RESO Analytics.
In AWS, there are a couple options:
Other cloud providers, such as Azure, have support for events and streams as well.
Sliding Windows and Event Longevity
What if a provider uses a product or service that only supports sliding windows? For example, Kinesis. Perhaps they want to archive older events for other reasons.
One option is to periodically copy the log to cloud storage, such as S3, and compact it while doing so, so that if a client requests an event prior to the current window it could still be retrieved using the requests shown above, even if it were a bit "slower," and batched according to client and server capabilities. Events should be fairly compressible too, having repeating resource names as the majority of their payload.
On the consumer side, this maintains a consistent interface and on the producer side it allows them to minimize cost by offloading storage of the log.
One Log vs. Many
It's up to providers to decide how they might want to architect their logs. They can either use one log for all events in the system and determine visibility on access, or create separate logs for separate data feeds if they want more isolation. Either way, having a log with increasing sequence numbers allows them to order the log correctly.
Are All Logs Durable?
What does "durable" mean in this context?
As used here, it means that a consumer has access to all events needed to seed their feed, rather than their not being available at a later time (ephemeral). There are lots of ways this can be implemented on the backend and partly depends on how providers partition their data. Events can also be forward-compacted, but there will always be at least one "pulse" per relevant record in the feed.
While the log for an overall system or feed (like IDX or BBO) might be durable and support replay, depending on how the provider designs their system, can some logs be ephemeral? Are there examples of business cases where one might store events with an expiration?
With respect to the listing transaction, even open houses and showings, it seems everything would be durable (whether it's visible to a given user is another matter)? Older media might be something that could be removed, depending on storage requirements.
How to Implement EntityEventSequence Numbers
One way to implement sequence numbers is to generate them in a database on a numeric identity column. Another way is to use a transactional global state store and increment sequence numbers with each transaction. Many third party software and cloud products generate them automatically.
Something to consider is that the ability for modern computers to execute operations can exceed millisecond precision. Especially if parallelized. So using something like Unix timestamps has some caveats.
If a system can generate more than 1,000 "pulses" in a second then millisecond precision is no longer adequate if using timestamps as sequence numbers. Nanoseconds might be necessary. Even with adequate precision, the system still needs to sequence each operation correctly (one for each millisecond or "tick" of the clock). Especially when performing bulk updates that require data consumers to re-pull many records in the log.
Others have come up with ways to combine timestamps with sequence numbers so that if the number of operations exceeds the precision of the given timestamp, the sequence numbers can be used to differentiate them while preserving the order of events in the system. See: Snowflake Ids, which also include system identifiers and sequence numbers, for more information. These satisfy the RCP-027 EntityEventSequence requirements.
Seeding the Log
Providers not currently using a log-based approach for their events can need to seed the log with the current system state for any record, including related records. During this bootstrapping phase, the provider would commit to some deterministic order of events (and would need to resolve timestamp internal timestamp collisions), and only one event per record is needed. This only needs to happen once and the provider would append only (with possible compaction) after the initial creation of the log.
Optional Relationship to HistoryTransactional
For those who support the HistoryTransactional Resource, there's an optional relationship between it and the EntityEvent Resource that may be supported. This might be useful, for example, for analytics consumers that are interested in the changes over time for a given set of records.
HistoryTransactional stores one record for each field in a given resource with old and new values. While each individual record in HistoryTransactional has a key and timestamp, there is no notion of a transaction id or batch/sequence id for a given set of changes outside of the resource name, record id, and timestamp. The EntityEventSequence could be used to link the change in HistoryTransactional to the EntityEvent Resource in these cases, if needed.
However, no explicit relationship is needed. Both EntityEvent and HistoryTransactional have the ResourceName and ResourceRecordKey fields, and those who have access to history can query it by those fields for each record in EntityEvent.
Additional Thoughts
How a provider implements their log is up to them, but the goal is to provide a simple and consistent interface for a data consumer to initialize their systems in a controlled manner from a large feed and remain synchronized.
References
Beta Was this translation helpful? Give feedback.
All reactions