EntityEvent Resource (RCP-027) and Initialization #93

darnjo · 2023-08-09T20:46:00Z

darnjo
Aug 9, 2023
Maintainer

Summary

One of the topics at the RESO 2023 Summer Developers Workshop was how to initialize systems using the EntityEvent Resource (log).

There were some questions around sliding windows and what it might mean if the log were to become large.

Initialization

Initialization was originally discussed in the Replication Subgroup (Transport), which is where the RCP-027 and RCP-028 proposals came from.

The thinking at the time was that data consumers needed a simple and reliable way to reinitialize their system or initialize new ones, in addition to syncing, from a given feed. Reading these items from a log at their own pace and picking up the corresponding records gives them a way to do so in a predictable manner.

Asking for records greater than or equal to 0 always works for initialization and asking for those greater than the last item they caught up with always works for syncing.

Requesting a given sequence number in the log should always return the next item in the sequence the consumer has access to if the item they requested is either not present or not visible to them. For example, if they ask for events greater than 41 and 42 isn't visible to them but 43 is, then 43 would be returned.

Initialization

// start at the beginning of the log
GET /EntityEvent?$top=1000&$filter=EntityEventSequence ge 0

...

// repeat until latest record has been fetched
GET /EntityEvent?$top=1000&$filter=EntityEventSequence gt <maxIndexFromLastRequest>

Syncing

// last index fetched was 10
GET /EntityEvent?$top=1000&$filter=EntityEventSequence gt 10

...

// repeat until latest record has been fetched
GET /EntityEvent?$top=1000&$filter=EntityEventSequence gt <maxIndexFromLastRequest>

In either case, the client-side logic is the same. They read from the log, then request and process the corresponding records.

Implementation Considerations

For providers with large amounts of data and/or very active systems, the number of records in the log can grow very large.

There are a couple of things built into the RCP-027 specification that should help a bit:

Small Event Size
- 2*64-bit integer fields + ~25*32bits UTF-8 ResourceName characters = ~116 bytes per entry
- Note that UTF-8 is 1-4 bytes (8-32 bits). The above estimate is worst case, assuming an average of 25 characters for resource names (longest in DD is shorter than this) and using the full 32 bits of UTF-8 characters. This may differ in practice.
- If there were 10,000,000,000 records, this would be ~1.2TB. For a million records, it's about 120MB.
- Note that the addition of an EventTypes array was proposed at the Summer Dev Workshop, which could increase the event size beyond the bounds outlined above.
Compaction - Providers can compact unique entries "forward" so there's only one latest instance of each record in the log, meaning the total number of entries is the total number of records in the system. There may be reasons to do this outside of storage constraints alone. For example, it provides a better experience for consumers to only get one unique pulse for each ResourceName, ResourceRecordKey pair.
- This still works for data consumers because either the last entry in the log they're asking for is still there, or it's been compacted and the entry is available at a later point in time. So they'll always get at least one "pulse" for every record in the system.
- For providers that don't offer compaction, the client can also handle duplicate "pulses" on their side to avoid extra HTTP requests for records they've already fetched. Any entry in the log fetches the current state of the record so each item only needs to be fetched once.

Potential Implementations

There are a lot of ways log-based replication could be implemented on the backend.

At the time the specification was written, Kafka was popular, which allows for a durable log that can grow very large and also supports compaction. Many cloud-based solutions enforce an expiration on event streams.

The goal in this section is to explore other options.

Storing Logs in Databases

Perhaps the simplest way to store a log is to do so in a DBMS or NoSQL database that supports numerical identity columns. These meet the requirements of being monotonically increasing, and could be compacted according to the method outlined above so there's only ever at most one (latest) entry for each record in the system.

Serverless

Serverless backends have become more popular since the RCP-027 specification was written, and RESO even uses them itself for RESO Analytics.

In AWS, there are a couple options:

Kinesis: 1 year max message expiration date, default is 24 hours. See: Changing the Data Retention Period.
Managed Kafka (MSK): Takes the work out of configuring and hosting Kafka and Zookeeper for those who might consider using it. It's "relatively" affordable, depending on activity and storage and offers Serverless services. See: MSK Pricing.

Other cloud providers, such as Azure, have support for events and streams as well.

Sliding Windows and Event Longevity

What if a provider uses a product or service that only supports sliding windows? For example, Kinesis. Perhaps they want to archive older events for other reasons.

One option is to periodically copy the log to cloud storage, such as S3, and compact it while doing so, so that if a client requests an event prior to the current window it could still be retrieved using the requests shown above, even if it were a bit "slower," and batched according to client and server capabilities. Events should be fairly compressible too, having repeating resource names as the majority of their payload.

On the consumer side, this maintains a consistent interface and on the producer side it allows them to minimize cost by offloading storage of the log.

One Log vs. Many

It's up to providers to decide how they might want to architect their logs. They can either use one log for all events in the system and determine visibility on access, or create separate logs for separate data feeds if they want more isolation. Either way, having a log with increasing sequence numbers allows them to order the log correctly.

Are All Logs Durable?

What does "durable" mean in this context?

As used here, it means that a consumer has access to all events needed to seed their feed, rather than their not being available at a later time (ephemeral). There are lots of ways this can be implemented on the backend and partly depends on how providers partition their data. Events can also be forward-compacted, but there will always be at least one "pulse" per relevant record in the feed.

While the log for an overall system or feed (like IDX or BBO) might be durable and support replay, depending on how the provider designs their system, can some logs be ephemeral? Are there examples of business cases where one might store events with an expiration?

With respect to the listing transaction, even open houses and showings, it seems everything would be durable (whether it's visible to a given user is another matter)? Older media might be something that could be removed, depending on storage requirements.

How to Implement EntityEventSequence Numbers

One way to implement sequence numbers is to generate them in a database on a numeric identity column. Another way is to use a transactional global state store and increment sequence numbers with each transaction. Many third party software and cloud products generate them automatically.

Something to consider is that the ability for modern computers to execute operations can exceed millisecond precision. Especially if parallelized. So using something like Unix timestamps has some caveats.

If a system can generate more than 1,000 "pulses" in a second then millisecond precision is no longer adequate if using timestamps as sequence numbers. Nanoseconds might be necessary. Even with adequate precision, the system still needs to sequence each operation correctly (one for each millisecond or "tick" of the clock). Especially when performing bulk updates that require data consumers to re-pull many records in the log.

Others have come up with ways to combine timestamps with sequence numbers so that if the number of operations exceeds the precision of the given timestamp, the sequence numbers can be used to differentiate them while preserving the order of events in the system. See: Snowflake Ids, which also include system identifiers and sequence numbers, for more information. These satisfy the RCP-027 EntityEventSequence requirements.

Seeding the Log

Providers not currently using a log-based approach for their events can need to seed the log with the current system state for any record, including related records. During this bootstrapping phase, the provider would commit to some deterministic order of events (and would need to resolve timestamp internal timestamp collisions), and only one event per record is needed. This only needs to happen once and the provider would append only (with possible compaction) after the initial creation of the log.

Optional Relationship to HistoryTransactional

For those who support the HistoryTransactional Resource, there's an optional relationship between it and the EntityEvent Resource that may be supported. This might be useful, for example, for analytics consumers that are interested in the changes over time for a given set of records.

HistoryTransactional stores one record for each field in a given resource with old and new values. While each individual record in HistoryTransactional has a key and timestamp, there is no notion of a transaction id or batch/sequence id for a given set of changes outside of the resource name, record id, and timestamp. The EntityEventSequence could be used to link the change in HistoryTransactional to the EntityEvent Resource in these cases, if needed.

However, no explicit relationship is needed. Both EntityEvent and HistoryTransactional have the ResourceName and ResourceRecordKey fields, and those who have access to history can query it by those fields for each record in EntityEvent.

Additional Thoughts

How a provider implements their log is up to them, but the goal is to provide a simple and consistent interface for a data consumer to initialize their systems in a controlled manner from a large feed and remain synchronized.

References

SergioDelRioT4Bi · 2023-08-14T19:23:42Z

SergioDelRioT4Bi
Aug 14, 2023
Maintainer

The original RCP-027 does not state that eventId>0 queries must give you the entire set of events for an object since it began. It is completely silent on this.

If we are proposing that all server vendors MUST support this, this should be a different RCP that suggests this behaviour as many vendors will not be able to support this easily.

Yes, there are many ways to do this, some easier than others, but I feel forcing this as a requirement is going to burden some server vendors with much additional work that they don't have on their roadmaps.

1 reply

darnjo Aug 22, 2023
Maintainer Author

In the original proposal, the EntityEvent Resource was intended for both initialization and synchronization, which is why it has the interface of a "durable log" rather than something more ephemeral, like a queue. This was developed over several Replication Subgroup meetings, as can be seen in the References in my above post (see: Replication Subgroup Minutes).

The reason for this approach is that, at the time of writing, each provider had different ways of supporting initialization and syncing (as well as deleted items) and the goal was to provide a simple and consistent interface so data consumers could always use the same logic and code to initialize or keep in sync.

I don't believe that this proposal will be too difficult for providers to implement and the log is relatively small and optionally supports "compaction" for those who are worried about it growing large.

There are lots of options as to how this might be accomplished, as outlined in the above post.

For example, using a database, there are two fairly straightforward approaches one might take:

Transactional
For each new record created, updated, or deleted, insert the ResourceName and ResourceRecordKey into an EntityEvent table during the upsert transaction.

Loosely Coupled
For each new record created, updated, or deleted, push an event into a queue to update the EntityEvent table with the ResourceName and ResourceRecordKey that were updated.

Optionally, providers implementing either method can remove the prior ResourceName/ResourceRecordKey pair(s) when doing so, in which case they'd only ever have one unique pulse for each record in the system.

The log can be thought of as an ordered, versioned key service where each ResourceName/ResourceRecordKey pair gets a new "version" in the form of a unique EntityEventSequence when changes are made. Since the log is monotonically increasing, consumers can remember the last sequence number they processed and reliably traverse the log to sync their current state. During initialization, fetch-by-key requests can be batched and executed in parallel if the provider allows it. This makes staying in sync simpler and more resilient, and also works when web hook consumers have their subscriptions disrupted and fall behind: they can catch up from the last event they synced and then resume their subscription.

If data consumers can't both initialize and sync from the log, then some other initialization process is needed, which means more complexity and more code and kind of defeats the purpose of using a durable log. In that case, the recommendation would be to table this proposal and take a different approach.

Note: It looks like during the conversion from Atlassian Server to Confluence Cloud, the latest proposal was partly corrupted and that's what was copied over to the Transport GitHub repository. The original has been restored, and a PDF is attached.

RCP-027 EntityEvent Resource and Replication Model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EntityEvent Resource (RCP-027) and Initialization #93

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

EntityEvent Resource (RCP-027) and Initialization #93

darnjo Aug 9, 2023 Maintainer

Summary

Initialization

Implementation Considerations

Potential Implementations

Storing Logs in Databases

Serverless

Sliding Windows and Event Longevity

One Log vs. Many

Are All Logs Durable?

How to Implement EntityEventSequence Numbers

Seeding the Log

Optional Relationship to HistoryTransactional

Additional Thoughts

References

Replies: 1 comment · 1 reply

SergioDelRioT4Bi Aug 14, 2023 Maintainer

darnjo Aug 22, 2023 Maintainer Author

darnjo
Aug 9, 2023
Maintainer

Replies: 1 comment 1 reply

SergioDelRioT4Bi
Aug 14, 2023
Maintainer

darnjo Aug 22, 2023
Maintainer Author