Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: add variant type support #11831

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Dec 19, 2024

This is to add some required changes in API and core module for Variant support, including:

  • Add isVariantType() method for variant type
  • Add variant support in schema visitors such as AssignIdFreshIds, ReassignIds, PruneColumns, etc. and TypeUtil.
  • Add test coverage to test out schema visitors, TypeUtil.

Part of: #10392

@aihuaxu aihuaxu marked this pull request as ready for review December 19, 2024 23:00
@aihuaxu
Copy link
Contributor Author

aihuaxu commented Dec 19, 2024

cc @rdblue, @RussellSpitzer, @flyrain and @JonasJ-ap. This is to add the changes in core to support variant type.

@aihuaxu aihuaxu requested a review from rdblue December 21, 2024 04:45
@aihuaxu aihuaxu force-pushed the variant-type-core branch 2 times, most recently from b276d3f to fe6038a Compare December 21, 2024 16:42
@shohamyamin
Copy link

@aihuaxu very important feature that will allow a lot more options for iceberg, thank you for your contribution

@aihuaxu aihuaxu force-pushed the variant-type-core branch 3 times, most recently from e0a430f to 1b56bf5 Compare January 30, 2025 05:12
@aihuaxu aihuaxu requested a review from rdblue February 10, 2025 17:36
new Schema(required(1, "v", testType), required(2, "A", Types.IntegerType.get()));

Schema assignedSchema =
TypeUtil.assignFreshIds(sourceSchema, new AtomicInteger(10)::incrementAndGet);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't use sourceSchema as intended. ID assignment with a source schema uses assigned IDs from that schema when names match and assigns missing/new IDs using the NextID supplier. When using the single schema version, the second schema should not be present.

That also raises the question about what we need to test for new types. In this case, because the type just needs to be handled by field and struct, I think it's fine to just have a case that validates that UnsupportedOperationException is not thrown and that the behavior is expected. That is, we only need on test with a variant here. Other tests with more logic for primitives may need more cases though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I just need one schema. I may not get the following: seems we can test for unknown and other primitive types here as well to make sure the correct type is returned, not null.

we only need on test with a variant here. Other tests with more logic for primitives may need more cases though.

@aihuaxu aihuaxu requested a review from rdblue February 13, 2025 02:33
@@ -187,6 +187,18 @@ public Schema map(Types.MapType map, Schema keySchema, Schema valueSchema) {
return mapSchema;
}

@Override
public Schema variant(Types.VariantType variant) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is included (and tested) in #12238. Can you remove it from here?

@@ -712,6 +712,10 @@ public T map(Types.MapType map, Supplier<T> keyResult, Supplier<T> valueResult)
return null;
}

public T variant(Types.VariantType variant) {
throw new UnsupportedOperationException("Unsupported type: variant");
Copy link
Contributor

@rdblue rdblue Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is not yet implemented by:

  • CheckCompatibility (tested by TestReadabilityChecks)
  • FixupTypes, SparkFixupTypes, or FlinkFixupTypes
  • PruneColumnsWithoutReordering (Spark, so maybe this can be done separately)
  • ReassignDoc
  • AssignIds

It is also not implemented by PruneColumnsWithReordering but that doesn't appear to be referenced anywhere. I opened #12258 to remove these.

public T variant() {
return null;
public T variant(Types.VariantType variant) {
throw new UnsupportedOperationException("Unsupported type: variant");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still need to be implemented by:

  • CreateMapping
  • DescribeSchemaVisitor
  • IdToOrcName and other conversions

I think it's fine to implement the conversions later as part of Avro, Parquet, and ORC work, but the common ones should probably be implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants