Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API, Core: Support default values in UpdateSchema #12211

Merged
merged 10 commits into from
Feb 13, 2025

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Feb 9, 2025

This adds default value support to UpdateSchema and the implementation of it in core.

The API has new methods that accept a default value:

  • addColumn(String name, Type type, String doc, Object defaultValue)
  • addRequiredColumn(String name, Type type, String doc, Object defaultValue)
  • updateColumnDefault(String name, Object defaultValue)
  • updateColumn(String name, Type type, String doc, Object defaultValue)

The new methods always require supplying a doc string, which can be null. However, callers should not need to know to pass null when there is no doc string. Because the type of a default value is Object, adding additional method signatures that omit the doc string would conflict (for instance addColumn("a", StringType.get(), "default or description?")).

To make the API easier to use without adding conflicting signatures or new method names, this updates the implementation to allow combining addRequiredColumn with updateColumnDefault. Previously, modifications to added columns were not supported, but adding support makes the API behave like a builder:

table.updateSchema()
    .addRequiredColumn("a", StringType.get())
    .updateColumnDefault("a", "unknown")
    .commit();

The API allows setting a field's default and does not expose the internal concepts of initial default and write default. When a column is added, its initial default can be set. Subsequent updates will only modify the write default. The only exception to this is that updateColumnDefault will change the initial default for a newly added required column if it is not set to enable the builder-like pattern.

private final boolean isOptional;
private final String name;
private boolean isOptional = true;
private String name = null;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes enable the builder to handle all updates, even to whether the field is optional and the field name. That way the builder is always used to copy instead of needing to support specific fields when changing the name.

@@ -55,7 +56,7 @@ class SchemaUpdate implements UpdateSchema {
private final Map<Integer, Integer> idToParent;
private final List<Integer> deletes = Lists.newArrayList();
private final Map<Integer, Types.NestedField> updates = Maps.newHashMap();
private final Multimap<Integer, Types.NestedField> adds =
private final Multimap<Integer, Integer> parentToAddedIds =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added fields are now tracked in updates so that the added fields can be updated in the same set of schema changes. Now this multi-map tracks the fields added to a parent struct by ID rather than keeping a list of unchangeable fields.

}

@Override
public UpdateSchema addRequiredColumn(String name, Type type, String doc) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved into the interface as a default implementation.

@@ -94,40 +95,24 @@ public SchemaUpdate allowIncompatibleChanges() {
}

@Override
public UpdateSchema addColumn(String name, Type type, String doc) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved into the interface as a default implementation.

fieldId,
Types.NestedField.of(fieldId, field.isOptional(), field.name(), newType, field.doc()));
}
Types.NestedField newField = Types.NestedField.from(field).ofType(newType).build();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, updates always use the builder to preserve existing metadata.

Collection<Types.NestedField> fieldsToAdd = adds.get(parentId);
if (fieldsToAdd == null || fieldsToAdd.isEmpty()) {
Collection<Types.NestedField> fieldsToAdd =
adds.get(parentId).stream().map(updates::get).collect(Collectors.toList());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A multi-map returns an empty list when the ID is not found.

api.addColumn(parentName, field.name(), field.type(), field.doc());
String fullName = (parentName != null ? parentName + "." : "") + field.name();
api.addColumn(parentName, field.name(), field.type(), field.doc(), field.initialDefault())
.updateColumnDefault(fullName, field.writeDefault());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To preserve both the initial and write defaults, this creates the field with the initial default (which sets both) and then updates the field, which sets only the write default because the initial default cannot change once it is set.

@github-actions github-actions bot added the flink label Feb 11, 2025
// incompatible changes.
assertThatThrownBy(() -> sql("ALTER TABLE tl ADD (pk STRING NOT NULL)"))
.hasRootCauseInstanceOf(IllegalArgumentException.class)
.hasRootCauseMessage("Incompatible change: cannot add required column: pk");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to remove these assertions than to update them. These cases are covered by core tests for schema updates, which is the canonical place to test behavior. Flink and Spark tests should validate the integration (that the ADD COLUMN is passed) rather than the behavior (that NOT NULL is rejected).

Comment on lines +200 to +201
* <p>Adding a required column without a default is an incompatible change that can break reading
* older data. To make this a compatible change, add a default value by calling {@link
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a required column, without a default value, should affect writers. Because the field is missing the write operation will be rejected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will cause writes without the required column to fail (as would adding an optional column), but it is a compatible change because you always know that the data can be read correctly. The incompatibility is when the NOT NULL constraint is violated because existing data has no value.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Now I need to do this on the Python side as well 👍

* @param name name for the new column
* @param type type for the new column
* @param doc documentation string for the new column
* @return this for method chaining
* @throws IllegalArgumentException If name contains "."
*/
UpdateSchema addColumn(String name, Type type, String doc);
default UpdateSchema addColumn(String name, Type type, String doc) {
Copy link
Contributor

@aokolnychyi aokolnychyi Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we ever considered accepting NestedField in methods that add columns and relying on its builder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't think it is a good idea. This API is user-facing and we don't want users to need to understand the distinction between initial default and write default. This API is supposed to match SQL capabilities and is higher level.

Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (pending checks) Literal is a huge improvement over passing an opaque object around.

@rdblue rdblue merged commit 602c35a into apache:main Feb 13, 2025
47 checks passed
@rdblue
Copy link
Contributor Author

rdblue commented Feb 13, 2025

Merged. Thanks for reviewing, @Fokko, @danielcweeks, and @aokolnychyi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants