-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API, Core: Support default values in UpdateSchema #12211
API, Core: Support default values in UpdateSchema #12211
Conversation
private final boolean isOptional; | ||
private final String name; | ||
private boolean isOptional = true; | ||
private String name = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes enable the builder to handle all updates, even to whether the field is optional and the field name. That way the builder is always used to copy instead of needing to support specific fields when changing the name.
@@ -55,7 +56,7 @@ class SchemaUpdate implements UpdateSchema { | |||
private final Map<Integer, Integer> idToParent; | |||
private final List<Integer> deletes = Lists.newArrayList(); | |||
private final Map<Integer, Types.NestedField> updates = Maps.newHashMap(); | |||
private final Multimap<Integer, Types.NestedField> adds = | |||
private final Multimap<Integer, Integer> parentToAddedIds = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added fields are now tracked in updates
so that the added fields can be updated in the same set of schema changes. Now this multi-map tracks the fields added to a parent struct by ID rather than keeping a list of unchangeable fields.
} | ||
|
||
@Override | ||
public UpdateSchema addRequiredColumn(String name, Type type, String doc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved into the interface as a default implementation.
@@ -94,40 +95,24 @@ public SchemaUpdate allowIncompatibleChanges() { | |||
} | |||
|
|||
@Override | |||
public UpdateSchema addColumn(String name, Type type, String doc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved into the interface as a default implementation.
fieldId, | ||
Types.NestedField.of(fieldId, field.isOptional(), field.name(), newType, field.doc())); | ||
} | ||
Types.NestedField newField = Types.NestedField.from(field).ofType(newType).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, updates always use the builder to preserve existing metadata.
Collection<Types.NestedField> fieldsToAdd = adds.get(parentId); | ||
if (fieldsToAdd == null || fieldsToAdd.isEmpty()) { | ||
Collection<Types.NestedField> fieldsToAdd = | ||
adds.get(parentId).stream().map(updates::get).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A multi-map returns an empty list when the ID is not found.
api.addColumn(parentName, field.name(), field.type(), field.doc()); | ||
String fullName = (parentName != null ? parentName + "." : "") + field.name(); | ||
api.addColumn(parentName, field.name(), field.type(), field.doc(), field.initialDefault()) | ||
.updateColumnDefault(fullName, field.writeDefault()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To preserve both the initial and write defaults, this creates the field with the initial default (which sets both) and then updates the field, which sets only the write default because the initial default cannot change once it is set.
// incompatible changes. | ||
assertThatThrownBy(() -> sql("ALTER TABLE tl ADD (pk STRING NOT NULL)")) | ||
.hasRootCauseInstanceOf(IllegalArgumentException.class) | ||
.hasRootCauseMessage("Incompatible change: cannot add required column: pk"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to remove these assertions than to update them. These cases are covered by core tests for schema updates, which is the canonical place to test behavior. Flink and Spark tests should validate the integration (that the ADD COLUMN is passed) rather than the behavior (that NOT NULL is rejected).
* <p>Adding a required column without a default is an incompatible change that can break reading | ||
* older data. To make this a compatible change, add a default value by calling {@link |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a required column, without a default value, should affect writers. Because the field is missing the write operation will be rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will cause writes without the required column to fail (as would adding an optional column), but it is a compatible change because you always know that the data can be read correctly. The incompatibility is when the NOT NULL constraint is violated because existing data has no value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Now I need to do this on the Python side as well 👍
* @param name name for the new column | ||
* @param type type for the new column | ||
* @param doc documentation string for the new column | ||
* @return this for method chaining | ||
* @throws IllegalArgumentException If name contains "." | ||
*/ | ||
UpdateSchema addColumn(String name, Type type, String doc); | ||
default UpdateSchema addColumn(String name, Type type, String doc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we ever considered accepting NestedField
in methods that add columns and relying on its builder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I don't think it is a good idea. This API is user-facing and we don't want users to need to understand the distinction between initial default and write default. This API is supposed to match SQL capabilities and is higher level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 (pending checks) Literal is a huge improvement over passing an opaque object around.
Merged. Thanks for reviewing, @Fokko, @danielcweeks, and @aokolnychyi! |
This adds default value support to
UpdateSchema
and the implementation of it in core.The API has new methods that accept a default value:
addColumn(String name, Type type, String doc, Object defaultValue)
addRequiredColumn(String name, Type type, String doc, Object defaultValue)
updateColumnDefault(String name, Object defaultValue)
updateColumn(String name, Type type, String doc, Object defaultValue)
The new methods always require supplying a doc string, which can be
null
. However, callers should not need to know to passnull
when there is no doc string. Because the type of a default value isObject
, adding additional method signatures that omit the doc string would conflict (for instanceaddColumn("a", StringType.get(), "default or description?")
).To make the API easier to use without adding conflicting signatures or new method names, this updates the implementation to allow combining
addRequiredColumn
withupdateColumnDefault
. Previously, modifications to added columns were not supported, but adding support makes the API behave like a builder:The API allows setting a field's default and does not expose the internal concepts of initial default and write default. When a column is added, its initial default can be set. Subsequent updates will only modify the write default. The only exception to this is that
updateColumnDefault
will change the initial default for a newly added required column if it is not set to enable the builder-like pattern.