Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor package to remove sharp edges for schema authors/users (particularly @row / Row) and improve API #54

Merged
merged 29 commits into from
Oct 27, 2022
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
917fbb9
Add name and type util functions for schemas
OTDE Aug 9, 2022
4ca4b42
Add tests for field name/type util functions
OTDE Aug 9, 2022
b42e64a
Invoke generated struct directly
OTDE Aug 9, 2022
def1138
Add dispatch on generated struct type
OTDE Aug 9, 2022
c4e0bb7
Undo accidental formatting
OTDE Aug 9, 2022
304d017
Use consistent kwarg spacing
OTDE Aug 9, 2022
f8a3a13
Apply suggestions from code review
OTDE Aug 9, 2022
cbd1fcf
Remove fallback method errors for util functions
OTDE Aug 29, 2022
176aa6f
Merge branch 'sc/column-name-utility-functions' of https://github.com…
OTDE Aug 29, 2022
1158b56
Update tests to expect new error type
OTDE Aug 29, 2022
c9d7de5
refactor
jrevels Oct 3, 2022
32b8663
fix typo
jrevels Oct 3, 2022
44efde2
julia version compat fix
jrevels Oct 3, 2022
3633d12
another julia compat fix
jrevels Oct 3, 2022
c2ea420
make comment consistent with code change
jrevels Oct 3, 2022
5c5c044
fix field type escaping and add a test
jrevels Oct 3, 2022
6ee2d8d
fix julia 1.3 compat issue again
jrevels Oct 3, 2022
2906296
moar julia 1.3 compat fixes
jrevels Oct 3, 2022
6b968c7
implement design proposal arising from #54 (#55)
jrevels Oct 26, 2022
f8b48fa
bump minimum Julia version to latest LTS minor version
jrevels Oct 26, 2022
c6c0d57
Update src/schemas.jl
jrevels Oct 26, 2022
53be860
Update src/schemas.jl
jrevels Oct 26, 2022
b9eca8a
Update docs/src/schema-concepts.md
jrevels Oct 26, 2022
e3dd4c0
tweak err msg, fix tests
jrevels Oct 26, 2022
5f8ec57
add additional syntax note
jrevels Oct 26, 2022
2fe38fc
make sure to test nested (de)serialization of record types
jrevels Oct 26, 2022
9fe8092
use record type names to drive version declaration, rather than ident…
jrevels Oct 27, 2022
7315fd3
add a couple more nested serialization tests
jrevels Oct 27, 2022
e8b0cb4
Update docs/src/schema-concepts.md
jrevels Oct 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/src/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,6 @@ Technically, Legolas.jl's core `@schema`/`@version` functionality is agnostic to

Otherwise, with regards to (de)serialization-specific functionality, Beacon has put effort into ensuring Legolas.jl works well with [Arrow.jl](https://github.com/JuliaData/Arrow.jl) "by default" simply because we're heavy users of the Arrow format. There's nothing stopping users from composing the package with [JSON3.jl](https://github.com/quinnj/JSON3.jl) or other packages.

## Why are Legolas.jl's generated record types defined the way that they are? For example, why is the version number hardcoded
## Why are Legolas.jl's generated record types defined the way that they are? For example, why is the version number hardcoded in the type name?

Many of Legolas' current choices on this front stem from refactoring efforts undertaken as part of [this pull request](https://github.com/beacon-biosignals/Legolas.jl/pull/54), and directly resulted from a [design mini-investigation](https://gist.github.com/jrevels/fdfe939109bee23566d425440b7c759e) associated with those efforts.
6 changes: 3 additions & 3 deletions docs/src/schema-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@ Legolas defines "schema version identifiers" as strings of the form:
- `version` is a non-negative integer.
- or, `x>y` where `x` and `y` are valid schema version identifiers and `>` denotes "extends from".

A schema version identifier is said to be *fully qualified* if it includes the identifiers of all known ancestors of the particular schema version that it directly identifies.
A schema version identifier is said to be *fully qualified* if it includes the identifiers of all ancestors of the particular schema version that it directly identifies.

Schema authors should follow the below conventions when choosing the name of a new schema:

1. Include a namespace. For example, assuming the schema is defined in a package Foo.jl, `foo.automobile` is good, `automobile` is bad.
2. Prefer singular over plural. For example, `foo.automobile` is good, `foo.automobiles` is bad.
3. Don't "overqualify" the schema name with ancestor-derived information. For example, `bar.automobile@1>foo.automobile@1` is good, `baz.supercar@1>bar.automobile@1` is good, `bar.foo.automobile@1>foo.automobile@1` is bad, `baz.automobile.supercar@1>bar.automobile@1` is bad.
3. Don't "overqualify" a schema name with ancestor-derived information that is better captured by the fully qualified identifier of a specific schema version. For example, `bar.automobile` should be preferred over `bar.foo.automobile`, since `bar.automobile@1>foo.automobile@1` is preferrable to `bar.foo.automobile@1>foo.automobile@1`. Similarly, `baz.supercar` should be preferred over `baz.automobile.supercar`, since `baz.supercar@1>bar.automobile@1` is preferrable to `baz.automobile.supercar@1>bar.automobile@1`.
jrevels marked this conversation as resolved.
Show resolved Hide resolved

## Schema Versioning: You Break It, You Bump It

Expand All @@ -33,7 +33,7 @@ For example, a schema author must introduce a new schema version for any of the
- An existing required field's type restriction is tightened.
- An existing required field is renamed.

One benefit of Legolas' approach is that multiple schema versions may be defined in the same codebase, e.g. there's nothing that prevents `@version("my-schema@1", ...)` and `@version("my-schema@2", ...)` from being defined and utilized simultaneously. The source code that defines any given Legolas schema version and/or consumes/produces Legolas tables is presumably already semantically versioned, such that consumer/producer packages can determine their compatibility with each other in the usual manner via interpreting major/minor/patch increments.
One benefit of Legolas' approach is that multiple schema versions may be defined in the same codebase, e.g. there's nothing that prevents `@version(FooV1, ...)` and `@version(FooV2, ...)` from being defined and utilized simultaneously. The source code that defines any given Legolas schema version and/or consumes/produces Legolas tables is presumably already semantically versioned, such that consumer/producer packages can determine their compatibility with each other in the usual manner via interpreting major/minor/patch increments.

Note that it is preferable to avoid introducing new versions of an existing schema, if possible, in order to minimize code/data churn for downstream producers/consumers. Thus, authors should prefer conservative field type restrictions from the get-go. Remember: loosening a field type restriction is not a breaking change, but tightening one is.

Expand Down
111 changes: 65 additions & 46 deletions examples/tour.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,25 +18,41 @@ using Legolas: @schema, @version, complies_with, find_violation, validate
# let's start the tour by declaring a new Legolas schema via the `@schema` macro.

# Here, we declare a new schema named `example.foo`, specifying that Legolas should
# use the prefix `Foo` whenever it generates `example.foo`-related type definitions:
# use the prefix `Foo` for all `example.foo`-related type definitions:
@schema "example.foo" Foo

# The above schema declaration provides the necessary scaffolding to start declaring
# new *versions* of the `example.foo` schema. Schema version declarations specify the
# set of required fields that a given table (or row) must contain in order to comply
# with that schema version. Let's use the `@version` macro to declare an initial
# version of the `example.foo` schema with some required fields:
@version "example.foo@1" begin
@version FooV1 begin
a::Real
b::String
c
d::AbstractVector
end

# Behind the scenes, this `@version` declaration automatically generated some type definitions
# and overloaded a bunch of useful Legolas methods with respect to `example.foo@1`. One of the
# types it generated is `FooSchemaV1`, an alias for `Legolas.SchemaVersion`:
@test FooSchemaV1() == Legolas.SchemaVersion("example.foo", 1)
# In the above declaration, the symbol `FooV1` can be broken into the prefix `Foo` (as
# specified in `example.foo`'s `@schema` declaration) and `1`, the integer that identifies
# this particular version of the `example.foo` schema. The `@version` macro requires this
# symbol to always follow this format (`$(prefix)V$(integer)`), because it generates two
# special types that match it. For example, our `@version` declaration above generated:
#
# - `FooV1`: A special subtype of `Tables.AbstractRow` whose fields match the corresponding
# schema version's declared required fields.
# - `FooV1SchemaVersion`: An alias for `Legolas.SchemaVersion` that matches the corresponding
# schema version.

# Let's first examine `FooV1SchemaVersion`:
@test Legolas.SchemaVersion("example.foo", 1) == FooV1SchemaVersion()
@test Legolas.SchemaVersion("example.foo", 1) isa FooV1SchemaVersion
@test "example.foo@1" == Legolas.identifier(FooV1SchemaVersion())

# As you can see, Legolas' Julia-agnostic identifier for this schema version is `example.foo@1`.
# To avoid confusion throughout this tour, we'll use this Julia-agnostic identifier to refer to
# individual schema versions in the abstract sense, while we'll use the relevant `SchemaVersion`
# aliases to specifically refer to the types that represent schema versions in Julia.

#####
##### `Tables.Schema` Compliance/Validation
Expand All @@ -53,28 +69,28 @@ for s in [Tables.Schema((:a, :b, :c, :d), (Real, String, Any, AbstractVector)),
Tables.Schema((:a, :b, :d), (Int, String, Vector)), # Fields whose declared type constraints are `>:Missing` may be elided entirely.
Tables.Schema((:a, :x, :b, :y, :d), (Int, Any, String, Any, Vector))] # Non-required fields may also be present.
# if `complies_with` finds a violation, it returns `false`; returns `true` otherwise
@test complies_with(s, FooSchemaV1())
@test complies_with(s, FooV1SchemaVersion())

# if `validate` finds a violation, it throws an error indicating the violation;
# returns `nothing` otherwise
@test validate(s, FooSchemaV1()) isa Nothing
@test validate(s, FooV1SchemaVersion()) isa Nothing

# if `find_violation` finds a violation, it returns a tuple indicating the relevant
# field and its violation; returns `nothing` otherwise
@test isnothing(find_violation(s, FooSchemaV1()))
@test isnothing(find_violation(s, FooV1SchemaVersion()))
end

# ...while the below `Tables.Schema`s do not:

s = Tables.Schema((:a, :c, :d), (Int, Float64, Vector)) # The required non-`>:Missing` field `b::String` is not present.
@test !complies_with(s, FooSchemaV1())
@test_throws ArgumentError validate(s, FooSchemaV1())
@test isequal(find_violation(s, FooSchemaV1()), :b => missing)
@test !complies_with(s, FooV1SchemaVersion())
@test_throws ArgumentError validate(s, FooV1SchemaVersion())
@test isequal(find_violation(s, FooV1SchemaVersion()), :b => missing)

s = Tables.Schema((:a, :b, :c, :d), (Int, String, Float64, Any)) # The type of required field `d::AbstractVector` is not `<:AbstractVector`.
@test !complies_with(s, FooSchemaV1())
@test_throws ArgumentError validate(s, FooSchemaV1())
@test isequal(find_violation(s, FooSchemaV1()), :d => Any)
@test !complies_with(s, FooV1SchemaVersion())
@test_throws ArgumentError validate(s, FooV1SchemaVersion())
@test isequal(find_violation(s, FooV1SchemaVersion()), :d => Any)

# The expectations that characterize Legolas' particular notion of "schematic compliance" - requiring the
# presence of pre-specified declared fields, assuming non-present fields to be implicitly `missing`, and allowing
Expand All @@ -87,12 +103,14 @@ s = Tables.Schema((:a, :b, :c, :d), (Int, String, Float64, Any)) # The type of r
#####
##### Legolas-Generated Record Types
#####
# In addition to `FooSchemaV1`, `example.foo@1`'s `@version` declaration also generated a new type,
# `FooV1 <: Tables.AbstractRow`, whose fields are guaranteed to match all the fields required by
# `example.foo@1`. We refer to such Legolas-generated types as "Legolas record types" (see
# https://en.wikipedia.org/wiki/Record_(computer_science)).

# Legolas record type constructors accept keyword arguments or `Tables.AbstractRow`-compliant values:
# As mentioned in this tour's introduction, `FooV1` is a subtype of `Tables.AbstractRow` whose fields are guaranteed to
# match all the fields required by `example.foo@1`. We refer to such Legolas-generated types as "record types" (see
# https://en.wikipedia.org/wiki/Record_(computer_science)). These record types are direct subtypes of
# `Legolas.AbstractRecord`, which is, itself, a subtype of `Tables.AbstractRow`:
@test FooV1 <: Legolas.AbstractRecord <: Tables.AbstractRow

# Record type constructors accept keyword arguments or `Tables.AbstractRow`-compliant values:
fields = (a=1.0, b="hi", c=π, d=[1, 2, 3])
@test NamedTuple(FooV1(; fields...)) == fields
@test NamedTuple(FooV1(fields)) == fields
Expand Down Expand Up @@ -129,7 +147,7 @@ foo = FooV1(; a=1.0, b="hi", d=[1, 2, 3])
# any such assignments, so let's declare a new schema version `example.bar@1` that does:
@schema "example.bar" Bar

@version "example.bar@1" begin
@version BarV1 begin
x::Union{Int8,Missing} = ismissing(x) ? x : Int8(clamp(x, -128, 127))
y::String = string(y)
z::String = ismissing(z) ? string(y, '_', x) : z
Expand Down Expand Up @@ -177,7 +195,7 @@ const GLOBAL_STATE = Ref(0)

@schema "example.bad" Bad

@version "example.bad@1" begin
@version BadV1 begin
x::Int = x + 1
y = (GLOBAL_STATE[] += y; GLOBAL_STATE[])
end
Expand All @@ -198,7 +216,7 @@ fields = (x=1, y=1)
# as an "extension" of `example.bar@1`:
@schema "example.baz" Baz

@version "example.baz@1 > example.bar@1" begin
@version BazV1 > BarV1 begin
x::Int8
z::String
k::Int64 = ismissing(k) ? length(z) : k
Expand All @@ -211,14 +229,14 @@ end
# For a given Legolas schema version extension to be valid, all `Tables.Schema`s that comply with the child
# must comply with the parent, but the reverse need not be true. We can check a schema version's required fields
# and their type constraints via `Legolas.required_fields`. Based on these outputs, it is a worthwhile exercise
# to confirm for yourself that `BazSchemaV1` is a valid extension of `BarSchemaV1` under the aforementioned rule:
@test Legolas.required_fields(BarSchemaV1()) == (x=Union{Missing,Int8}, y=String, z=String)
@test Legolas.required_fields(BazSchemaV1()) == (x=Int8, y=String, z=String, k=Int64)
# to confirm for yourself that `BazV1SchemaVersion` is a valid extension of `BarV1SchemaVersion` under the aforementioned rule:
@test Legolas.required_fields(BarV1SchemaVersion()) == (x=Union{Missing,Int8}, y=String, z=String)
@test Legolas.required_fields(BazV1SchemaVersion()) == (x=Int8, y=String, z=String, k=Int64)

# As a counterexample, the following is invalid, because the declaration of `x::Any` would allow for `x`
# values that are disallowed by the parent schema version `example.bar@1`:
@schema "example.broken" Broken
@test_throws Legolas.SchemaVersionDeclarationError @version "example.broken@1 > example.bar@1" begin x::Any end
@test_throws Legolas.SchemaVersionDeclarationError @version BrokenV1 > BarV1 begin x::Any end

# Record type constructors generated for extension schema versions will apply the parent's field
# assignments before applying the child's field assignments. Notice how `BazV1` applies the
Expand All @@ -237,28 +255,29 @@ end
# return new(x, y, z, k)
# end

# One last note on syntax: You might ask "Why use `>` as the inheritance operator instead of `<:`?" There are two reasons.
# The primary reason is purely historical: earlier versions of Legolas did not as rigorously demand/enforce subtyping
# relationships between parent and child schemas' required fields, and so the `<:` operator was considered to be a bit
# too misleading. A secondary reason in favor of `>` was that it implied the actual order of application of field
# constraints/transformations (i.e. the parent's are applied before the child's).
# One last note on syntax: You might ask "Why use the greater-than symbol as the inheritance operator instead of `<:`?"
# There are a few reasons. The primary reason is purely historical: earlier versions of Legolas did not as rigorously
# demand/enforce subtyping relationships between parent and child schemas' required fields, and so the `<:` operator
# was considered to be a bit too misleading. A secondary reason in favor of `>` was that it implied the actual order
# of application of constraints (i.e. the parent's are applied before the child's). Lastly, `>` aligns well with the
# property that child schema versions have a greater number of constraints than their parents.

#####
##### Schema Versioning
#####

# Throughout this tour, all `@version` declarations have used the version number `1`, and thus every generated
# record type and `SchemaVersion` alias has had the suffix `V1`. As you might guess, you can declare more than
# a single version of any given schema, and the generated types' suffix will always match the version integer:
# Throughout this tour, all `@version` declarations have used the version number `1`. As you might guess, you can
# declare more than a single version of any given schema. Here's an example using the `example.foo` schema we defined
# earlier:

@version "example.foo@2" begin
@version FooV2 begin
a::Float64
b::String
c::Int
d::Vector
end

@test FooSchemaV2() == Legolas.SchemaVersion("example.foo", 2)
@test FooV2SchemaVersion() == Legolas.SchemaVersion("example.foo", 2)

fields = (a=1.0, b="b", c=3, d=[1,2,3])
@test NamedTuple(FooV2(fields)) == fields
Expand All @@ -278,7 +297,7 @@ fields = (a=1.0, b="b", c=3, d=[1,2,3])

@schema "example.param" Param

@version "example.param@1" begin
@version ParamV1 begin
a::Int
b::(<:Real)
c
Expand All @@ -296,7 +315,7 @@ end

@schema "example.child-param" ChildParam

@version "example.child-param@1 > example.param@1" begin
@version ChildParamV1 > ParamV1 begin
c::(<:Union{Real,String})
d::(<:Union{Real,Missing})
e
Expand Down Expand Up @@ -329,24 +348,24 @@ table_isequal(a, b) = isequal(Legolas.materialize(a), Legolas.materialize(b))
# key whose value is `Legolas.schema_identifier(schema)`. This field enables consumers of the table to
# perform automated (or manual) schema discovery/evolution/validation.
io = IOBuffer()
Legolas.write(io, table, BazSchemaV1())
Legolas.write(io, table, BazV1SchemaVersion())
t = Arrow.Table(seekstart(io))
@test Arrow.getmetadata(t) == Dict("legolas_schema_qualified" => "example.baz@1>example.bar@1")
@test table_isequal(t, Arrow.Table(Arrow.tobuffer(table)))
@test table_isequal(t, Arrow.Table(Legolas.tobuffer(table, BazSchemaV1()))) # `Legolas.tobuffer` is analogous to `Arrow.tobuffer`
@test table_isequal(t, Arrow.Table(Legolas.tobuffer(table, BazV1SchemaVersion()))) # `Legolas.tobuffer` is analogous to `Arrow.tobuffer`

# Similarly, Legolas provides `Legolas.read(src)`, which wraps `Arrow.Table(src)`, but
# validates the deserialized `Arrow.Table` against its declared schema version before
# returning it:
@test table_isequal(Legolas.read(Legolas.tobuffer(table, BazSchemaV1())), t)
@test table_isequal(Legolas.read(Legolas.tobuffer(table, BazV1SchemaVersion())), t)
msg = """
could not extract valid `Legolas.SchemaVersion` from the `Arrow.Table` read
via `Legolas.read`; is it missing the expected custom metadata and/or the
expected \"legolas_schema_qualified\" field?
"""
@test_throws ArgumentError(msg) Legolas.read(Arrow.tobuffer(table))
invalid = [Tables.rowmerge(row; k=string(row.k)) for row in table]
invalid_but_has_metadata = Arrow.tobuffer(invalid; metadata=("legolas_schema_qualified" => Legolas.identifier(BazSchemaV1()),))
invalid_but_has_metadata = Arrow.tobuffer(invalid; metadata=("legolas_schema_qualified" => Legolas.identifier(BazV1SchemaVersion()),))
@test_throws ArgumentError("field `k` has unexpected type; expected <:Int64, found String") Legolas.read(invalid_but_has_metadata)

# A note about one additional benefit of `Legolas.read`/`Legolas.write`: Unlike their Arrow.jl counterparts,
Expand All @@ -363,7 +382,7 @@ invalid_but_has_metadata = Arrow.tobuffer(invalid; metadata=("legolas_schema_qua

@schema "example.portable" Portable

@version "example.portable@1" begin
@version PortableV1 begin
id::UUID = UUID(id)
end

Expand All @@ -379,8 +398,8 @@ end
# since its UUID conversion behavior (and the corresponding type constraint) may be useful for validated construction.

# Luckily, it turns out that Legolas is actually smart enough to natively support this by default:
@test complies_with(Tables.Schema((:id,), (UUID,)), PortableSchemaV1())
@test complies_with(Tables.Schema((:id,), (UInt128,)), PortableSchemaV1())
@test complies_with(Tables.Schema((:id,), (UUID,)), PortableV1SchemaVersion())
@test complies_with(Tables.Schema((:id,), (UInt128,)), PortableV1SchemaVersion())

# How is this possible? Well, when Legolas checks whether a given field `f::T` matches a required field `f::F`, it doesn't
# directly check that `T <: F`; instead, it checks that `T <: Legolas.accepted_field_type(sv, F)` where `sv` is the relevant
Expand Down
Loading