Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement design proposal arising from #54 #55

Merged
merged 11 commits into from
Oct 26, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ version = "0.5.0"
[deps]
Arrow = "69666777-d1a9-59fb-9406-91d4454c9d45"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
UUIDs = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[compat]
Arrow = "2"
Expand Down
4 changes: 2 additions & 2 deletions docs/src/arrow-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ Legolas defines a special field `legolas_schema_qualified` that Legolas-aware Ar

Arrow tables which include this field are considered to "support Legolas schema discovery" and are referred to as "Legolas-discoverable", since Legolas consumers may employ this field to automatically match the table against available application-layer Legolas schema definitions.

If present, the `legolas_schema_qualified` field's value must be a [fully qualified schema identifier](@ref schema_identifier_specification).
If present, the `legolas_schema_qualified` field's value must be a [fully qualified schema version identifier](@ref schema_version_identifier_specification).

## Arrow File Naming Conventions

When writing a Legolas-discoverable Arrow table to a file, prefer using the file extension `*.<unqualified schema name>.arrow`. For example, if the file's table's Legolas schema is `baz.supercar@1>bar.automobile@1`, use the file extension `*.baz.supercar.arrow`.
When writing a Legolas-discoverable Arrow table to a file, prefer using the file extension `*.<schema name>.arrow`. For example, if the file's table's full Legolas schema version identifier is `baz.supercar@1>bar.automobile@1`, use the file extension `*.baz.supercar.arrow`.
6 changes: 5 additions & 1 deletion docs/src/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ The package originated from code developed internally at Beacon to wrangling het

## Why does Legolas.jl support Arrow as a (de)serialization target, but not, say, JSON?

Technically, Legolas.jl's core `row`/`Schema` functionality is totally agnostic to (de)serialization and could be useful for anybody who wants to wrangle Tables.jl-compliant values.
Technically, Legolas.jl's core `@schema`/`@version` functionality is agnostic to (de)serialization and could be useful for anybody who wants to wrangle Tables.jl-compliant values.

Otherwise, with regards to (de)serialization-specific functionality, Beacon has put effort into ensuring Legolas.jl works well with [Arrow.jl](https://github.com/JuliaData/Arrow.jl) "by default" simply because we're heavy users of the Arrow format. There's nothing stopping users from composing the package with [JSON3.jl](https://github.com/quinnj/JSON3.jl) or other packages.

## Why are Legolas.jl's generated record types defined the way that they are? For example, why is the version number hardcoded

Many of Legolas' current choices on this front stem from refactoring efforts undertaken as part of [this pull request](https://github.com/beacon-biosignals/Legolas.jl/pull/54), and directly resulted from a [design mini-investigation](https://gist.github.com/jrevels/fdfe939109bee23566d425440b7c759e) associated with those efforts.
26 changes: 14 additions & 12 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,24 +11,27 @@ CurrentModule = Legolas
## Legolas `Schema`s

```@docs
Legolas.Schema
Legolas.SchemaVersion
Legolas.@schema
Legolas.@version
Legolas.is_valid_schema_name
Legolas.parse_schema_identifier
Legolas.schema_name
Legolas.schema_version
Legolas.schema_identifier
Legolas.schema_parent
Legolas.schema_fields
Legolas.schema_declaration
Legolas.schema_declared
Legolas.row
Legolas.parse_identifier
Legolas.name
Legolas.version
Legolas.identifier
Legolas.parent
Legolas.required_fields
Legolas.declaration
Legolas.declared
Legolas.find_violation
Legolas.complies_with
Legolas.validate
```

## Validating/Writing/Reading Legolas Tables

```@docs
Legolas.extract_legolas_schema
Legolas.extract_schema_version
Legolas.write
Legolas.read
```
Expand All @@ -38,7 +41,6 @@ Legolas.read
```@docs
Legolas.lift
Legolas.construct
Legolas.guess_schema
Legolas.assign_to_table_metadata!
Legolas.gather
Legolas.locations
Expand Down
46 changes: 18 additions & 28 deletions docs/src/schema-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,59 +4,49 @@

If you're a newcomer to Legolas.jl, please familiarize yourself with the [tour](https://github.com/beacon-biosignals/Legolas.jl/blob/main/examples/tour.jl) before diving into this documentation.

## [Schema Identifiers](@id schema_identifier_specification)
## [Schema Version Identifiers](@id schema_version_identifier_specification)

Legolas defines "schema identifiers" as strings of the form:
Legolas defines "schema version identifiers" as strings of the form:

- `name@version` where:
- `name` is a lowercase alphanumeric string and may include the special characters `.` and `-`.
- `version` is a non-negative integer.
- or, `x>y` where `x` and `y` are valid schema identifiers and `>` denotes "extends from".
- or, `x>y` where `x` and `y` are valid schema version identifiers and `>` denotes "extends from".

A schema identifier is said to be *fully qualified* if it includes the identifiers of all known ancestors of the particular schema that it directly identifies.
A schema version identifier is said to be *fully qualified* if it includes the identifiers of all known ancestors of the particular schema version that it directly identifies.

Schema authors should follow the below conventions when choosing the `name` part of a new schema's identifier:
Schema authors should follow the below conventions when choosing the name of a new schema:

1. Include a namespace. For example, assuming the schema is defined in a package Foo.jl, `foo.automobile` is good, `automobile` is bad.
2. Prefer singular over plural. For example, `foo.automobile` is good, `foo.automobiles` is bad.
3. Don't "overqualify" the schema name with ancestor-derived information. For example, `bar.automobile@1>foo.automobile@1` is good, `baz.supercar@1>bar.automobile@1` is good, `bar.foo.automobile@1>foo.automobile@1` is bad, `baz.automobile.supercar@1>bar.automobile@1` is bad.

## Schema Versioning: You Break It, You Bump It

While it is fairly established practice to [semantically version source code](https://semver.org/), the world of data/artifact versioning is a bit more varied. As presented in the tour, each `Legolas.Schema` has a single version integer. The central rule that governs Legolas' schema versioning approach is:
While it is fairly established practice to [semantically version source code](https://semver.org/), the world of data/artifact versioning is a bit more varied. As presented in the tour, each `Legolas.SchemaVersion` carries a single version integer. The central rule that governs Legolas' schema versioning approach is:

**If an update is made to a schema that potentially requires existing data to be rewritten in order to comply with the updated schema, then the version integer associated with that schema should be incremented.**
**Do not introduce a change to an existing schema version that might cause existing compliant data to become non-compliant; instead, incorporate the intended change in a new schema version whose version number is one greater than the previous version number.**

In other words: you break it, you bump it!
For example, a schema author must introduce a new schema version for any of the following changes:

For example, a schema author must increment their existing schema's version integer if any of the following changes are made:

- A new non-`>:Missing` required field is added to the schema.
- A new type-restricted required field is added to the schema.
- An existing required field's type restriction is tightened.
- An existing required field is renamed.

One benefit of Legolas' approach is that multiple schema versions may be defined in the same codebase, e.g. there's nothing that prevents `@schema("my-schema@1", ...)` and `@schema("my-schema@2", ...)` from being defined and utilized simultaneously. The source code that defines any given Legolas schema and/or consumes/produces Legolas tables is presumably already semantically versioned, such that consumer/producer packages can determine their compatibility with each other in the usual manner via interpreting major/minor/patch increments.
One benefit of Legolas' approach is that multiple schema versions may be defined in the same codebase, e.g. there's nothing that prevents `@version("my-schema@1", ...)` and `@version("my-schema@2", ...)` from being defined and utilized simultaneously. The source code that defines any given Legolas schema version and/or consumes/produces Legolas tables is presumably already semantically versioned, such that consumer/producer packages can determine their compatibility with each other in the usual manner via interpreting major/minor/patch increments.

## Important Expectations Regarding Custom Field Assignments
Note that it is preferable to avoid introducing new versions of an existing schema, if possible, in order to minimize code/data churn for downstream producers/consumers. Thus, authors should prefer conservative field type restrictions from the get-go. Remember: loosening a field type restriction is not a breaking change, but tightening one is.

Schema authors should ensure that their schema declarations meet two important expectations so that Legolas' `row` function behaves as intended and inter-schema composability is preserved.
## Important Expectations Regarding Custom Field Assignments

First, a schema's custom field assignments should preserve the [idempotency](https://en.wikipedia.org/wiki/Idempotence) of `row` invocations, such that the following holds for all valid values of `fields`:
Schema authors should ensure that their `@version` declarations meet two important expectations so that generated record types behaves as intended:

```jl
row(schema, row(schema, fields)) == row(schema, fields)
```
1. Custom field assignments should preserve the [idempotency](https://en.wikipedia.org/wiki/Idempotence) of record type constructors.
2. Custom field assignments should not observe mutable non-local state.

Second, a schema's custom field assignments should not observe mutable non-local state, such that the following holds for all valid values of `fields`:
Thus, given a Legolas-generated record type `R`, the following should hold for all valid values of `fields`:

```jl
row(schema, fields) == row(schema, fields)
R(R(fields)) == R(fields)
R(fields) == R(fields)
```

## How to Avoid Breaking Schema Changes

It is preferable to avoid incrementing a schema's version integer ("making a breaking change") whenever possible to avoid code/data churn for consumers. Following the below guidelines should help make breaking changes less likely:

1. Allow required fields to be `Missing` whenever reasonable.
2. Prefer conservative field type restrictions from the get-go, to avoid needing to tighten them later.
3. Handle/enforce "potential deprecation paths" in a required field's RHS definition when possible. For example, imagine a schema that contains a required field `id::Union{UUID,String} = id` where `id` is either a `UUID`, or a `String` that may be parsed as a `UUID`. Now, let's imagine we decided we wanted to update the schema such that new tables ALWAYS normalize `id` to a proper `UUID`. In this case, it is preferable to simply update this required field to `id::Union{UUID,String} = UUID(id)` instead of `id::UUID = id`. The latter is a breaking change that requires incrementing the schema's version integer, while the former achieves the same practical result without breaking consumers of old data.
Loading