Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UUID conversion to and from 16 byte fixed sequences #1

Merged
merged 4 commits into from
Feb 21, 2025

Conversation

urmastalimaa
Copy link
Member

@urmastalimaa urmastalimaa commented Feb 12, 2025


UUIDs are often passed around in application code in their canonical,
hex as string representation e.g. "550e8400-e29b-41d4-a716-446655440000".
Encoding UUIDs as Avro "string"s takes 37 bytes, while encoding UUIDs in
their binary form fits into a 16 byte sized "fixed", saving 21 bytes per
encoding.

This change allows application code to keep passing around canonical hex
UUIDs while converting to the compact encoding, requiring only
uuid_format: :canonical_string to be given in decode options.

The Java reference implementation also supports
encoding UUIDs as both strings and 16 byte fixed sequences.

  • Encoding is augmented such that a 16 byte fixed schema with
    %{"logicalType" => "uuid"}, converts a hex-string UUID to the 16
    byte binary representation.

  • Decoding is augmented such that given uuid_format: :canonical_string
    in decode options, the binary representation is converted to the
    canonical hex-string representation.

The encoding change is nearly backwards-compatible, previously when
given an incorrectly size "fixed" with {"logicalType": "uuid"}, an
error was raised, while now conversion is attempted.

The decoding change is fully backwards-compatible, as uuid_format
defaults to :binary.

For UUID codec, the uniq library was added (no transitive
dependencies).

davydog187 and others added 3 commits July 31, 2024 11:11
UUIDs are often passed around in application code in their canonical,
hex as string representation e.g. "550e8400-e29b-41d4-a716-446655440000".
Encoding UUIDs as Avro "string"s takes 37 bytes, while encoding UUIDs in
their binary form fits into a 16 byte sized "fixed", saving 21 bytes per
encoding.

This change allows application code to keep passing around canonical hex
UUIDs while converting to the compact encoding, requiring only
`uuid_format: :canonical_string` to be given in decode options.

The [Java reference implementation][java-implementation] also supports
encoding UUIDs as both strings and 16 byte fixed sequences.

* Encoding is augmented such that a 16 byte fixed schema with
  `%{"logicalType" => "uuid"}`, converts a hex-string UUID to the 16
  byte binary representation.

* Decoding is augmented such that given `uuid_format: :canonical_string`
  in decode options, the binary representation is converted to the
  canonical hex-string representation.

The encoding change is nearly backwards-compatible, previously when
given an incorrectly size "fixed" with `{"logicalType": "uuid"}`, an
error was raised, while now conversion is attempted.

The decoding change is fully backwards-compatible, as `uuid_format`
defaults to `:binary`.

For UUID codec, the `uniq` library was added (no transitive
dependencies).

[java-implementation]: https://github.com/apache/avro/blob/230414abbb68e63e68f3b55bfc0cbca94f2737f6/lang/java/avro/src/main/java/org/apache/avro/LogicalTypes.java#L291-L309
Copy link
Member

@indrekj indrekj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. cc @laurglia you were asking about the same thing some time ago

@urmastalimaa urmastalimaa force-pushed the fixed_uuid branch 3 times, most recently from 5deff6f to e43f62b Compare February 21, 2025 12:04
setup-beam does not allow ubuntu-24
@urmastalimaa urmastalimaa merged commit 1839f9d into master Feb 21, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants