-
Notifications
You must be signed in to change notification settings - Fork 100
Psi Store Format
Here we describe the byte-level format of \psi stores so that they may be potentially consumed outside of \psi and outside of .NET. The same format is used for remoting with an additional handshake protocol documented here.
Referenece implementations are available for Python and for F# (though, being a .NET language, \psi itself may be used from F#).
As described in the Brief Introduction and expounded on in the section covering Datasets, a \psi store is an extremely efficient and easy way to persist \psi streams to disk. For example:
var store = PsiStore.Create(pipeline, "MyStore", "~/Some/Path");
myStream.Write("MyStream", store);
myOtherStream.Write("OtherStream", store);
Stores may be explored and visualized in PsiStudio or may be opened for playback within a pipeline. For example:
var store = PsiStore.Open(pipeline, "MyStore", "~/Some/Path");
var myStream = store.OpenStream<double>("MyStream");
var myOtherStream = store.OpenStream<MyType>("OtherStream");
Many streams may be written to a single store and streams may be of any .NET type, including user-provided types. When reading back streams, the types must be known and given as the type parameter (T
) to the generic OpenStream<T>
API. This also implies that the types are available to the consuming application (i.e., the correct assembly references have been made).
Having the types is not strictly necessary in order to recover the data. The store contains type-schema information defining the shape of the data, decomposing to a small list of primitives. To read a stream for which the .NET types are unknown or unavailable, the OpenDynamicStream(...)
API may be used to open a stream of dynamic
objects having members with names and leaf-node types matching the original type (e.g. MyType
):
var myOtherStream = store.OpenDynamicStream("OtherStream");
This same type-schema metadata may also be used to drive parsing and reconstruction of message data from persisted \psi stores within ecosystems outside of \psi and outside of .NET (e.g. from Python), which we will demonstrate as we go.
A \psi store is a collection of files on disk, containing serialized data from all of the streams that have been written to it, as well as a catalog of metadata information and indexes to facilitate random access:
MyStore.Catalog_000000.psi
MyStore.Data_000000.psi
MyStore.Data_000001.psi
MyStore.Index_000000.psi
MyStore.LargeData_000000.psi
MyStore.LargeData_000002.psi
MyStore.LargeData_000003.psi
MyStore.Live
A \psi store comprises three main types of files:
- Data \ LargeData containing serialized message data.
- Catalog containing metadata describing the available streams; statistics and type information.
- Index containing index entries to facilitate random access to message data.
Each is broken into numbered extents and given names in the form <StoreName>.<Type>_<extent>.psi
. Each type of file represents a contiguous set, but is broken into extents to limit individual file sizes. Data
, for example, is broken into 256MB files, numbered *.Data_000000.psi
, *.Data_000001.psi
, ...
Additionally, *.Live
is a marker file indicating that the store is actively being written to. For example, it may be quite useful to view a store in PsiStudio while the writing application is running.
Data within each file is broken into "blocks" of opaque packets of bytes. Blocks are written sequentially until each "extent" is filled and a new extent is created. These chains of extents are sometimes referred to as "infinite files" because, in a live application, they represent data from or about streams that may accumulate indefinitely. The only limit is disk space.
All the types of *.psi
files have the same foundational structure. Starting with the first (*.*_000000.psi
) file, opaque blocks of data within may be read. Each is a length-prefixed block of bytes:
Length | Block ... | Length | Block ... | ... |
---|---|---|---|---|
Int32 | n-Bytes | Int32 | n-Bytes |
The length is a 4-byte little-endian signed integer. Given a positive length, this is followed by n-bytes comprising the block. Rinse and repeat.
A length of zero signals the end of the chain of blocks. However, if the *.Live
marker file is present, this could be temporary while waiting for more data to be written.
A negative length indicates the end of an extent. The absolute value of the length is then the extent number with which to continue. For example, a -1
length in extent *_000000
means to continue with extent *_000001
.
Booleans are encoded as single-byte values where 1
signifies True
. Floating point values are 32-/64-bit IEEE. Fixnum integer values are little-endian. DateTimes and strings are discussed further in the Parsing Primitives section below.
Next we'll talk about the meaning of the payload of the opaque blocks of bytes within each type of file.
Tools such as PsiStudio may want to quickly seek to a particular time within the data store. To facilitate this, the Index
stores information about where to look for messages corresponding to a particular time.
Each entry represents a byte-position within an extent, along with the largest creation- and originating-time seen up to that point.
Extent | Position | CreationTime | OriginatingTime |
---|---|---|---|
Int32 | Int32 | DateTime | DateTime |
Entries are not written for every message recorded, but instead are triggered by a threshold of data (i.e., 4KB) having been written. This potentially sparse indexing is enough to seek to the vicinity of a desired time stamp and begin reading.
The Extent
and Position
fields are 4-byte little-endian signed integers. As we'll discuss below, message data may be stored in the Data
extents or in the LargeData
extents. A positive Extent
number in an index entry indicates a (non-large) Data
file. For example an Extent
of 1
means to look in *.Data.000001.psi
. A negative Extent
number is encoded such that adding 2�� (2,147,483,648) to it yields the LargeData
extent number. For example, an Extent
of -2147483647
means to look in *.LargeData_000001.psi
.
The Position
is the byte-position within that corresponds to the Length
field of a block. For example, a Position
of 12345
means to skip to the 12345th byte and begin reading blocks (4-byte length, followed by n-bytes as usual).
The CreationTime
and OriginatingTime
fields represent the largest time value seen up to the Position
in the given Extent
. There may be messages between indexed points. They are encoded into 8-bytes, the lower 62-bits of which represent the 100-nanosecond ticks since 1/1/0001 12:00AM in UTC (see the Parsing Primitives/DateTime section below).
Catalog files contain metadata information about the runtime, the streams contained in the store and schema information covering the message types. Metadata blocks begin with:
Name | Id | TypeName | Version | SerializerTypeName | SerializerVersion | CustomFlags | Kind |
---|---|---|---|---|---|---|---|
String | Int32 | String | Int32 | String | Int32 | UInt16 | UInt16 |
The meaning of the fields depends on the Kind
of metadata, as described below. String
fields are length-prefixed, UTF-8 bytes as described in the Parsing Primitives section below. The Int32
fields are 4-byte little-endian signed integers while the UInt16
fields are 2-byte unsigned little-endian.
The Kind
field indicates the kind of metadata the record, of which there are three:
-
StreamMetadata (
Kind
= 0) describes a stream contained in the store. -
RuntimeInfo (
Kind
= 1) describes the \psi runtime and hosting application that persisted the store. -
TypeSchema (
Kind
= 2) describes a datatype used in a message stream.
RuntimeInfo
fields have the following contents. A single such record is written at the start of the catalog.
-
Name - the full name of the
Microsoft.Psi
assembly used to persist the store (e.gMicrosoft.Psi, Version=..., Culture=neutral, PublicKeyToken=...
). -
Id - Unused (
0
). -
TypeName - Full name of the
Microsoft.Psi
assembly. -
Version - The
Microsoft.Psi
assembly version encoded as 16-bit major followed by 16-bit minor. -
SerializerTypeName - Unused (
null
). -
SerializerVersion - The runtime version (e.g.
2
, not the assembly version, but a small number incremented as breaking changes are made). -
CustomFlags - Unused (
0
).
StreamMetadata
main fields have the following contents. A record for each stream is written once on pipeline start and again with updated information when streams are closed.
-
Name - Stream name (give to the
Write(<name>, store)
API). - Id - An ID assigned by the \psi runtime.
- TypeName - The full assembly-qualified .NET type name of the type of messages on the stream.
-
Version - The \psi metadata version (e.g.
2
, incremented as breaking changes are made). -
SerializerTypeName - Unused (
null
). -
SerializerVersion - Unused (
0
) - CustomFlags - indicate whether the stream is persisted, closed, indexed, and/or polymorphic.
StreamMetadata
contains additional fields describing the wall-clock time extents of the stream, the number of messages written to the stream, and statistics around the message size and latency:
OpenedTime | ClosedTime | MessageCount | MessageSizeCumulativeSum | LatencyCumulativeSum |
---|---|---|---|---|
DateTime | DateTime | Int64 | Int64 | Int64 |
The parsing of 8-byte DateTime
values has been described briefly above and in detail in the Parsing Primitives section below. The Int64
values are 8-byte little-endian signed integers.
- OpenedTime - the wall-clock time that the stream was opened.
- ClosedTime - the wall-clock time that the stream was closed (updated upon close).
- MessageCount - the number of messages on the stream (updated upon close).
- MessageSizeCumulativeSum - the total bytes of all messages on the stream (updated upon close).
- LatencyCumulativeSum - the total 100-nanosecond ticks of accumulated latency (difference between creation- and originating-times, updated upon close).
Some of these fields (and some below) may be updated in subsequent records for a given stream. For example, ClosedTime
, MessageCount
, etc. may contain dummy (e.g. 0
) values. Once a stream has been closed the final record can be expected to be accurate. The CustomFlags
field (described below) may be used to determine whether a record represents a closed stream.
Following these fields are additional fields indicating the first and last message creation- and originating-times:
FirstMessageCreationTime | LastMessageCreationTime | FirstMessageOriginatingTime | LastMessageOriginatingTime |
---|---|---|---|
DateTime | DateTime | DateTime | DateTime |
The CustomFlags
field contains bits indicates several non-mutually exclusive attributes about a stream:
- 0x01 - Not persisted to the store.
- 0x02 - The stream has closed.
-
0x04 - The stream is indexed, meaning that index entries appear in
Data
while message payload appears inLargeData
, as discussed below in the Data Files section. - 0x08 - The stream contains a polymorphic message type, as discussed next.
Polymorphic streams may be of various concrete types. In this case, the TypeName
indicates the polymorphic type name but, for the purpose of parsing such streams, the possible concrete types must be known. Thus, StreamMetadata
marked with CustomFlags
indicating a polymorphic message type (mask 0x08
) are followed by a list of the concrete types (and updated StreamMetadata
records are emitted as new concrete types are encountered on the stream):
TypeCount | TypeId | TypeName | TypeId | TypeName | ... |
---|---|---|---|---|---|
Int32 | Int32 | String | Int32 | String | ... |
The TypeCount
indicates the number of concrete types (TypeId
/TypeName
pairs) to follow. Each TypeName
is a full assembly-qualified .NET type name while the TypeId
is an index value assigned by the \psi runtime.
Following this are several fields describing supplemental stream metadata, which is a value of a user-specified type. This is represented as opaque serialized bytes along with the type name. Deserialization is identical to message payload deserialization as described below in the Data Files section.
SupplementalTypeName | SupplementalLength | SupplementalBytes |
---|---|---|
String | Int32 | n-Bytes |
The SupplementalTypeName
is a full assembly-qualified .NET type name. The SupplementalLength
indicates the number of bytes comprising the SupplementalBytes
.
It should be noted that the above format applies to the current version (version 2
, 23 APR 2021) of the StreamMetadata
format. The Version
field indicates the version actually being read. In prior versions, the MessageCount
field was an Int32
rather than an Int64
and the MessageSizeCumulativeSum
and LatencyCumulativeSum
were not present. Instead, following the LastMessageOriginatingTime
field, an AverageMessageSize
(Int32
) and AverageMessageLatency
(Int32
, in microseconds) field gave similar statistics. Finally, the idea of supplemental stream metadata, and so the Supplemental*
fields, were added in version 1 (16 JUN 2020).
In order to parse message data (and supplemental stream metadata), the schema of the data must be known. The TypeSchema
catalog records serve this purpose. The main fields have the following contents:
- Name - The contract name within a composite type.
- Id - A unique ID assigned by by the \psi runtime.
- TypeName - The full assembly-qualified .NET type name.
- Version - The version of the serializer that generated the schema (used for versioning of custom serializers, as described in the Data File section).
- SerializerTypeName - The full assembly-qualified .NET type name of the serializer that generated the schema.
- SerializerVersion - The version of the serializer that generated the schema (used for versioning of custom serializers, as described in the Data File section)
-
CustomFlags - Unused (
0
).
Following this are fields describing members of the type:
MemberCount | Name | Type | IsRequired | Name | Type | IsRequired | ... | TypeFlags |
---|---|---|---|---|---|---|---|---|
Int32 | String | String | Boolean | String | String | Boolean | ... | UInt32 |
The MemberCount
indicates the number of members (Name
/Type
/IsRequired
triples) to follow. Each Name
is the simple member name, while the Type
is a full assembly-qualified .NET type name. The IsRequired
flag is a single-byte boolean (1
= false, 1
= true) indicating whether the member is required (to support optional members for backwards compatibility).
The final TypeFlags
field in a 4-byte little-endian unsigned integer indicating the category of the type:
- 0x01 - The type is a class (reference type).
- 0x02 - The type is a struct (value type).
- 0x04 - The type is a contract (interface type).
- 0x08 - The type is a collection (enumerable type).
It should be noted that the TypeFlags
field was added in version 2 (30 NOV 2018) of the TypeSchema
format. Prior to this, a determination would need to be made from the Type
name alone, which implies reflection of the underlying .NET type, or a mapping of such information. For this reason, a non-.NET \psi store reader that is backward compatible with older stores may be difficult to robustly implement for unexpected types.
Some schema names do not represent the assembly-qualified .NET type name. Instead, the name has been taken from a data contract attibute. However, the type names within schemas (and stream type names) are always assembly-qualified type names. To overcome this, we must provide a mapping of known data contract names to their equivalent type names.
The data files represent streams of contiguous messages. The Data
and LargeData
files contain nearly identically formatted message data, interleaved among all of the streams written to the store. Each block begins with a message envelope:
SourceId | SequenceId | OriginatingTime | CreationTime |
---|---|---|---|
Int32 | Int32 | DateTime | DateTime |
The SourceId
is a stream ID assigned by the \psi runtime. It corresponds to the Id
field of a StreamMetadata
entry from the Catalog
. The SequenceId
is a monotonically increasing message number within the stream. The OriginatingTime
is the time at which the message originated in the world (at the source sensor, etc.) while the CreationTime
is the pipeline time at which the message was last propagated.
Warning: The OriginatingTime
and CreationTime
fields are (for no particular reason) ordered differently to the same fields within index entries.
Remember that stream metadata may contain CustomFlags
marking a stream as being "indexed" (mask 0x04
). Such streams are treated specially by writing the message blocks to the LargeData
file and writing an "index entry" pointing to it into the Data
file. This means that in order to know how to interpret a message block within the Data
file, it much be known whether the stream is indexed. For this, we must refer to StreamMetadata
entries previously read from the Catalog
. The SourceId
of the message will correspond to the Id
field of a StreamMetadata
. The CustomFlags
of which will tell us whether the stream is indexed.
For an indexed stream in the Data
file, the envelop will be followed by an "index entry" identical to those found in the Index
files:
Extent | Position | CreationTime | OriginatingTime |
---|---|---|---|
Int32 | Int32 | DateTime | DateTime |
The Extent
and Position
will point to the LargeData
file set for the actual message data. The LargeData
file never contains index entries and may be read independently as a mere sequence of interleaved "large" messages or may be indexed into. It should be noted that the interleaving of messages roughly as they occurred in the pipeline is preserved in the "small" Data
file set, while the LargeData
files are only interleaved with other indexed "large" streams.
For streams that are not indexed (or for streams in the LargeData
files), further parsing is driven by type-schema information previously gathered from the Catalog
. Beginning with the TypeSchema
corresponding to the SourceId
, we parse the message bytes. A reference implementation can be found in the DymanicMessageDeserializer.
The base case is when the TypeName
is a known primitive and can be parsed directly. Names are assembly-qualified, but it is safe to consider up to the first ','
character (e.g. pseudocode typeName.split(',')[0]
). The following are the simple primitive types:
- "System.Single" (32-bit IEEE floating point)
- "System.Double" (64-bit IEEE floating point)
- "System.SByte" (signed byte)
- "System.Byte" (unsigned byte)
- "System.Int16" (signed 2-byte little-endian integer)
- "System.UInt16" (unsigned 2-byte little-endian integer)
- "System.Int32" (signed 4-byte little-endian integer)
- "System.UInt32" (unsigned 4-byte little-endian integer)
- "System.Int64" (signed 8-byte little-endian integer)
- "System.UInt64" (unsigned 8-byte little-endian integer)
- "System.Boolean" (1-byte,
0
= false,1
= true) - "System.Char" (2-byte UTF-16 character)
- "System.DateTime" (8-byte DateTime)
- "System.String" (length-prefixed UTF-8 bytes)
Parsing of some of these types is described in more detail below in the Parsing Primitives section.
Strings (System.String
) are somewhat special. As a reference type, strings may be preceded by a prefix allowing for reusable instances. This prefix is present only when the string is not a member of a collection (see below).
Prefix |
---|
UInt32 |
The the low 30-bits may contain a ID, while the high 2-bits indicate the kind of prefix:
High Bits | Low Bits (mask 0x3FFFFFFF) | Meaning |
---|---|---|
b00 (0x0) | ... | Null |
b01 (0x4) | Instance Ordinal | Existing Instance |
b10 (0x8) | ... | New Instance |
b11 (0xC) | TypeSchema ID | Typed |
A Prefix
mask of 0x0
means that the string is null
(string parsing already supports null
by way of a -1
length, but this is an additional mechanism used for reference types in general). In this case, the Prefix
is not followed by a string.
A Prefix
mask of 0x80000000
means that this is a new string and is followed by a string to parse:
Prefix | Length | UTF-8 Encoded Bytes |
---|---|---|
0x8 | Int32 | n-Bytes |
This new instance should be tucked away in a per-stream "cache" keyed by ordinal. This is because a Prefix
mask of 0x4
means that a string is an existing instance from the cache (to intern strings for efficiency). A value mask of 0x3FFFFFFF
yields the zero-based index into the cache. For example, a Prefix
mask of 0x40000007
means to return the 8th cached instance.
All other types are compositions of the primitive types. Based on the SourceId
, type-schema information previously read from the Catalog
should be retrieved. Remember that the TypeFlags
field of the TypeSchema
will determine whether we are reading a struct, a class, a contract, or a collection.
All composite types except structs (i.e., classes, contracts, and collections) are treated as reference types and have a Prefix
field similar to strings.
Prefix |
---|
UInt32 |
Just as with strings, a Prefix
mask of 0x0
means that the value is null
and the Prefix
is not followed by value to parse.
Again the New
and Existing
flags avoid the inefficiency of encoding the same value many times and (more importantly) to encode cyclic structures. It is up to the deserializer to maintain a cache on a per-message basis of ref instances deserialized thus far. If a given value is contained more than once, the subsequent instances will be merely encoded as Existing
, with the lower 30-bits giving an ID (cached ordinal) used to recover the shared instance.
Unlike strings, other reference types may be polymorphic or concrete implementations of a contract (interface). That is for example, a stream may be of type T
but messages on the stream may be of implementation or subtypes U
or V
. A Prefix
mask of 0xC
indicates a concrete subtype. A value mask of 0x3FFFFFFF
yields the Id
of the TypeSchema
with which to drive parsing.
A Prefix
mask of 0x8
again means that this is a new instance and is followed by a value to parse (the precise type of which depends on the type-schema -- described later):
Prefix | Schema-driven Value |
---|---|
0x8 | ... |
Once parsed, this new instance too should be tucked away in the per-stream "cache" keyed by ordinal. This is both for efficiency and to allow serialization of object graphs containing cycles.
A Prefix
mask of 0x4
again means that an existing instance from the cache should be returned and the Prefix
is not followed by a value to parse. A value mask of 0x3FFFFFFF
yields the zero-based index into the cache. For example, a Prefix
mask of 0x40000007
means to return the 8th cached instance.
Composite types other than collections (i.e., structs, classes, and contracts) comprise a set of named members.
Reference types (i.e. classes and contracts, but not structs) should be added to the per-stream instance cache. This should be done (with dummy instances) before parsing members in order to support circular references.
To populate an instance, walk the TypeSchema
member Name
/Type
information in the order that it was originally read from the Catalog
and parse and assign each member value (again, consider this document to be recursive!).
Collections (indicated by TypeFlags
field of TypeSchema
) are length-prefixed sets of elements. These are composite types as well and so are preceded by a Prefix
and the semantics of that explained in the previous section.
Length | n-Elements ... |
---|---|
UInt32 | ... |
Collections are homogenous and the type can be found on the "Elements"
member of the TypeSchema
, which happens to always be the first member (e.g. pseudocode schema.Member[0].Type
).
Parsing each element, driven by the TypeSchema
, produces the collection (consider this document to be recursive!).
It should be noted that strings are treated specially. Normally, string values are preceded by a Prefix
field as discussed above. But when strings are members of collections they are parsed as bare values (as if they followed a Prefix
indicating a new instance). They are still added to the per-stream instance cache of reference type instances.
Also, a set of deserializers must provide a mapping from type names to a corresponding deserilization function. It is pre-populated with the .NET primitives along with several custom serializers (which cannot be derived purely by type schema information). The single exception is the System.String
deserializer, because it is a reference type with interations with the internal instance cache. These deserializers are explained in detail in the Custom Serializers section below.
The DateTime
type is encoded in 8-bytes, representing .NET DateTime
values which may be parsed as follows:
Kind | Ticks |
---|---|
2-bits | 62-bits |
The Kind
flags are described here, but \psi always uses UTC (high bit set). The lower 62-bits (mask 0x3FFFFFFFFFFFFFFF
) represent 100-nanosecond ticks since 1/1/0001 12:00AM.
Strings are a length-prefixed string of UTF-8 encoded bytes.
Length | UTF-8 Encoded Bytes |
---|---|
Int32 | n-Bytes |
When the Length
field is -1
then the value is null
, whereas when the Length
is 0
the value is a null-string (""
). For positive Length
values, n-bytes follow and should be interpreted as a UTF-8 encoded character string.
Booleans are single-byte values where 1
indicates True
.
Reading a stream within a store is accomplished by reading data and filtering the to the stream in which we're interested -- that is, filtering to envelopes with a sourceId
matching the desired stream ID. Remember that stream data is interleaved in the store. A more convenient API may be to read a stream by name rather than by ID. For this, we need to find the stream metadata for the given name in the catalog and extract the ID.
However this would be an inefficent way to read many streams from a store. For a single-pass implimentation that reads multiple streams at once, filter to multiple streams and dispatch messages on a per-stream basis.
If the goal is to randomly access the data, exploring by timestamps (i.e., PsiStudio) then the Index
is used to quickly seek within the data files. Likely the whole index can be maintained in memory to direct loading, seeking within and reading of individual Data
/LargeData
extents on demand. Individual entries may be converted to an extent ID, position and indication of whether data can be found in the regular or large extents.
Some types have hand-crafted serializers in \psi. Some of these produce type schema infomation that fully describes their format and so need no special treatment. Others need matching hand-crafted deserializers. The data reader handles reading the prefix, caching instances, etc. Custom deserializers need only to handle actual instance bytes.
A MemoryStream
is serialized as a simple length-prefixed array of bytes:
Length | Buffer ... |
---|---|
Int32 | n-Bytes |
- Basic Stream Operators
- Writing Components
- Pipeline Execution
- Delivery Policies
- Stream Fusion and Merging
- Interpolation and Sampling
- Windowing Operators
- Stream Generators
- Parallel Operator
- Intervals
- Data Visualization (PsiStudio)
- Data Annotation (PsiStudio)
- Distributed Systems
- Bridging to Other Ecosystems
- Debugging and Diagnostics
- Shared Objects
- Datasets
- Event Sources
- 3rd Party Visualizers
- 3rd Party Stream Readers
Components and Toolkits
- List of NuGet Packages
- List of Components
- Audio Overview
- Azure Kinect Overview
- Kinect Overview
- Speech and Language Overview
- Imaging Overview
- Media Overview
- ONNX Overview
- Finite State Machine Toolkit
- Mixed Reality Overview
- How to Build/Configure
- How to Define Tasks
- How to Place Holograms
- Data Types Collected
- System Transparency Note
Community
Project Management