API, Core: Support default values in UpdateSchema #12211

rdblue · 2025-02-09T20:27:34Z

This adds default value support to UpdateSchema and the implementation of it in core.

The API has new methods that accept a default value:

addColumn(String name, Type type, String doc, Object defaultValue)
addRequiredColumn(String name, Type type, String doc, Object defaultValue)
updateColumnDefault(String name, Object defaultValue)
updateColumn(String name, Type type, String doc, Object defaultValue)

The new methods always require supplying a doc string, which can be null. However, callers should not need to know to pass null when there is no doc string. Because the type of a default value is Object, adding additional method signatures that omit the doc string would conflict (for instance addColumn("a", StringType.get(), "default or description?")).

To make the API easier to use without adding conflicting signatures or new method names, this updates the implementation to allow combining addRequiredColumn with updateColumnDefault. Previously, modifications to added columns were not supported, but adding support makes the API behave like a builder:

table.updateSchema()
    .addRequiredColumn("a", StringType.get())
    .updateColumnDefault("a", "unknown")
    .commit();

The API allows setting a field's default and does not expose the internal concepts of initial default and write default. When a column is added, its initial default can be set. Subsequent updates will only modify the write default. The only exception to this is that updateColumnDefault will change the initial default for a newly added required column if it is not set to enable the builder-like pattern.

rdblue · 2025-02-09T20:29:45Z

api/src/main/java/org/apache/iceberg/types/Types.java

-      private final boolean isOptional;
-      private final String name;
+      private boolean isOptional = true;
+      private String name = null;


These changes enable the builder to handle all updates, even to whether the field is optional and the field name. That way the builder is always used to copy instead of needing to support specific fields when changing the name.

rdblue · 2025-02-09T20:31:01Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

@@ -55,7 +56,7 @@ class SchemaUpdate implements UpdateSchema {
  private final Map<Integer, Integer> idToParent;
  private final List<Integer> deletes = Lists.newArrayList();
  private final Map<Integer, Types.NestedField> updates = Maps.newHashMap();
-  private final Multimap<Integer, Types.NestedField> adds =
+  private final Multimap<Integer, Integer> parentToAddedIds =


Added fields are now tracked in updates so that the added fields can be updated in the same set of schema changes. Now this multi-map tracks the fields added to a parent struct by ID rather than keeping a list of unchangeable fields.

rdblue · 2025-02-09T20:31:24Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

-  }
-
-  @Override
-  public UpdateSchema addRequiredColumn(String name, Type type, String doc) {


Moved into the interface as a default implementation.

rdblue · 2025-02-09T20:31:26Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

@@ -94,40 +95,24 @@ public SchemaUpdate allowIncompatibleChanges() {
  }

  @Override
-  public UpdateSchema addColumn(String name, Type type, String doc) {


Moved into the interface as a default implementation.

rdblue · 2025-02-09T20:32:17Z

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

-          fieldId,
-          Types.NestedField.of(fieldId, field.isOptional(), field.name(), newType, field.doc()));
-    }
+    Types.NestedField newField = Types.NestedField.from(field).ofType(newType).build();


Now, updates always use the builder to preserve existing metadata.

rdblue · 2025-02-09T20:33:11Z

core/src/main/java/org/apache/iceberg/mapping/MappingUtil.java

-      Collection<Types.NestedField> fieldsToAdd = adds.get(parentId);
-      if (fieldsToAdd == null || fieldsToAdd.isEmpty()) {
+      Collection<Types.NestedField> fieldsToAdd =
+          adds.get(parentId).stream().map(updates::get).collect(Collectors.toList());


A multi-map returns an empty list when the ID is not found.

rdblue · 2025-02-09T20:34:22Z

core/src/main/java/org/apache/iceberg/schema/UnionByNameVisitor.java

-    api.addColumn(parentName, field.name(), field.type(), field.doc());
+    String fullName = (parentName != null ? parentName + "." : "") + field.name();
+    api.addColumn(parentName, field.name(), field.type(), field.doc(), field.initialDefault())
+        .updateColumnDefault(fullName, field.writeDefault());


To preserve both the initial and write defaults, this creates the field with the initial default (which sets both) and then updates the field, which sets only the write default because the initial default cannot change once it is set.

rdblue · 2025-02-11T00:46:57Z

flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/TestFlinkCatalogTable.java

-    // incompatible changes.
-    assertThatThrownBy(() -> sql("ALTER TABLE tl ADD (pk STRING NOT NULL)"))
-        .hasRootCauseInstanceOf(IllegalArgumentException.class)
-        .hasRootCauseMessage("Incompatible change: cannot add required column: pk");


I think it's better to remove these assertions than to update them. These cases are covered by core tests for schema updates, which is the canonical place to test behavior. Flink and Spark tests should validate the integration (that the ADD COLUMN is passed) rather than the behavior (that NOT NULL is rejected).

api/src/main/java/org/apache/iceberg/UpdateSchema.java

api/src/main/java/org/apache/iceberg/types/Types.java

api/src/main/java/org/apache/iceberg/UpdateSchema.java

Fokko · 2025-02-11T12:30:57Z

api/src/main/java/org/apache/iceberg/UpdateSchema.java

+   * <p>Adding a required column without a default is an incompatible change that can break reading
+   * older data. To make this a compatible change, add a default value by calling {@link


Adding a required column, without a default value, should affect writers. Because the field is missing the write operation will be rejected.

It will cause writes without the required column to fail (as would adding an optional column), but it is a compatible change because you always know that the data can be read correctly. The incompatibility is when the NOT NULL constraint is violated because existing data has no value.

api/src/main/java/org/apache/iceberg/UpdateSchema.java

core/src/main/java/org/apache/iceberg/SchemaUpdate.java

Fokko

LGTM

Now I need to do this on the Python side as well 👍

aokolnychyi · 2025-02-12T20:16:47Z

api/src/main/java/org/apache/iceberg/UpdateSchema.java

   * @param name name for the new column
   * @param type type for the new column
   * @param doc documentation string for the new column
   * @return this for method chaining
   * @throws IllegalArgumentException If name contains "."
   */
-  UpdateSchema addColumn(String name, Type type, String doc);
+  default UpdateSchema addColumn(String name, Type type, String doc) {


Have we ever considered accepting NestedField in methods that add columns and relying on its builder?

Yes, but I don't think it is a good idea. This API is user-facing and we don't want users to need to understand the distinction between initial default and write default. This API is supposed to match SQL capabilities and is higher level.

danielcweeks

+1 (pending checks) Literal is a huge improvement over passing an opaque object around.

rdblue · 2025-02-13T19:37:41Z

Merged. Thanks for reviewing, @Fokko, @danielcweeks, and @aokolnychyi!

API, Core: Support default values in UpdateSchema.

0823872

github-actions bot added API core labels Feb 9, 2025

rdblue commented Feb 9, 2025

View reviewed changes

rdblue added 4 commits February 9, 2025 12:38

Apply spotless.

77b6790

API: Add default implementations to fix revapi.

2e68565

Fix rename detection.

9e5810f

Remove stale tests from Flink.

585c763

github-actions bot added the flink label Feb 11, 2025

rdblue commented Feb 11, 2025

View reviewed changes

danielcweeks reviewed Feb 11, 2025

View reviewed changes

api/src/main/java/org/apache/iceberg/UpdateSchema.java Outdated Show resolved Hide resolved

danielcweeks reviewed Feb 11, 2025

View reviewed changes

api/src/main/java/org/apache/iceberg/UpdateSchema.java Outdated Show resolved Hide resolved