Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8817] fix backward compatibility issue #12615

Merged
merged 1 commit into from
Jan 27, 2025

Conversation

Davis-Zhang-Onehouse
Copy link
Contributor

@Davis-Zhang-Onehouse Davis-Zhang-Onehouse commented Jan 10, 2025

Change Logs

There are 3 configs at interplay:

table config hoodie.table.version - can be 6 or 8
writer config hoodie.write.table.version - can be 6 or 8
writer config hoodie.write.auto.upgrade - boolean flag to control transition during upgrade, avoiding unintended upgrades.
hoodie.table.version reflects the actual state of the table.
hoodie.write.table.version defines the intended state of data written by the writer.

During upgrade, the initial phase for writers should be configured with hoodie.write.auto.upgrade=false and hoodie.write.table.version=6 until when users are ready to start writing in the new format (i.e. after all readers have been upgraded).

For creating new tables using 1.0, by default i.e. when none of the writer configs are set, hoodie.table.version should be 8. hoodie.write.table.version is automatically inferred to be 8.

With my change that explicitly set the write version, when user does not override the config, the version would be 8 automatically.

For creating new tables using 1.0, if there is a need to create one with older format, then writers should be configured with hoodie.write.auto.upgrade=false and hoodie.write.table.version=6. In this case, hoodie.table.version should also be 6.

It means in hudi 1.0, if a user wants to create version 6 table, they would set the config above and the system should honor that configuration in metaClient.initTable. This is why this change enforces when metaClient is created, we set the table version to write version. This is what is missing without the change

when we specify write table version as 6

CREATE TABLE hudi_14_table_sql_005 (
    event_id INT,
    event_date STRING,
    event_name STRING,
    event_ts STRING,
    event_type STRING
) USING hudi
 OPTIONS(
    type = 'cow', -- or 'mor'
    primaryKey = 'event_id,event_date',
    preCombileField = 'event_ts',
    hoodie.write.table.version = 6
)
PARTITIONED BY (event_type)
LOCATION 's3://<some_bucket>/hudi_14_table_sql_005';

the hoodie.properties said the version is 8.

It is because when building the meta client table version is not set according to the write table version. Found similar misses else where, fixed as well with test coverage.

Impact

hoodie.write.table.version will also contribute to hoodie.table.version.

Risk level (write none, low medium or high below)

none

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Jan 10, 2025
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Jan 10, 2025
@Davis-Zhang-Onehouse Davis-Zhang-Onehouse force-pushed the HUDI-8817 branch 2 times, most recently from a64c849 to 944075b Compare January 10, 2025 19:02
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:L PR with lines of changes in (300, 1000] labels Jan 10, 2025
yihua
yihua previously requested changes Jan 11, 2025
@@ -1271,10 +1272,24 @@ public TableBuilder fromProperties(Properties properties) {
setTableName(hoodieConfig.getString(HoodieTableConfig.NAME));
}

if (hoodieConfig.contains(VERSION)) {
if (hoodieConfig.contains(VERSION) && hoodieConfig.contains(WRITE_TABLE_VERSION)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in meta client should not consider write table version. Otherwise, there is no point checking whether hoodie.write.table.version and hoodie.table.version match.

CommonClientUtils#validateTableVersion has the validation. Does Spark SQL writer escape that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, SQL writer does not call this function at all for create table stmt

@Davis-Zhang-Onehouse Davis-Zhang-Onehouse marked this pull request as ready for review January 17, 2025 23:40
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 : can you ensure we are covering all meta client instantiation flows in flink

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I know, we can;t let users directly directly dictate the table version.
based on our migration guide, they can only configure "hoodie.write.auto.upgrade"
ref: https://hudi.apache.org/docs/deployment#upgrading-to-100

@yihua @codope : can you clarify this.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking on clarification

@Davis-Zhang-Onehouse
Copy link
Contributor Author

blocking on clarification

we discussed offline, let me know any AIs I should follow or please resume the review. Thanks

@nsivabalan
Copy link
Contributor

unblocking for now.
goal for this patch (open to splitting it up into multiple):

  • for new table creation, we should default to table version 8.
  • for new table creation, if user overrides "hoodie.write.table.version", we will honor the overridden value.

We need to validate and may be fix all writers (spark-ds, spark-sql, deltastreamer and spark structured streaming ingestion)

@nsivabalan
Copy link
Contributor

Deltastreamer :

HoodieTableMetaClient initializeEmptyTable(HoodieTableMetaClient.TableBuilder tableBuilder, String partitionColumns,

spark-sql and spark-ds: guess this patch already have tests.

spark-streaming:

def testStructuredStreaming(tableType: HoodieTableType): Unit = {

@@ -302,6 +303,7 @@ class HoodieSparkSqlWriterInternal {
else KeyGeneratorType.getKeyGeneratorClassName(hoodieConfig)
HoodieTableMetaClient.newTableBuilder()
.setTableType(tableType)
.setTableVersion(Integer.valueOf(getStringWithAltKeys(parameters, HoodieWriteConfig.WRITE_TABLE_VERSION)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't table version and write table version two different config options, not sure why setting up the table version always from the write table version, the write table version is auxiliary for migration, I kind of think we should not set it up explicitly every time we initiate the table. can we detect the table version automically from the table config
and validate against the write table version then?

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to clarify the coordination of table version of write table version, confusing to me.

@Davis-Zhang-Onehouse
Copy link
Contributor Author

Deltastreamer :

HoodieTableMetaClient initializeEmptyTable(HoodieTableMetaClient.TableBuilder tableBuilder, String partitionColumns,

spark-sql and spark-ds: guess this patch already have tests.

spark-streaming:

def testStructuredStreaming(tableType: HoodieTableType): Unit = {

created jira 1.0.2 https://issues.apache.org/jira/browse/HUDI-8914

@codope
Copy link
Member

codope commented Jan 25, 2025

There are 3 configs at interplay:

  • table config hoodie.table.version - can be 6 or 8
  • writer config hoodie.write.table.version - can be 6 or 8
  • writer config hoodie.write.auto.upgrade - boolean flag to control transition during upgrade, avoiding unintended upgrades.

hoodie.table.version reflects the actual state of the table.
hoodie.write.table.version defines the intended state of data written by the writer.

During upgrade, the initial phase for writers should be configured with hoodie.write.auto.upgrade=false and hoodie.write.table.version=6 until when users are ready to start writing in the new format (i.e. after all readers have been upgraded).

For creating new tables using 1.0, by default i.e. when none of the writer configs are set, hoodie.table.version should be 8. hoodie.write.table.version is automatically inferred to be 8.

For creating new tables using 1.0, if there is a need to create one with older format, then writers should be configured with hoodie.write.auto.upgrade=false and hoodie.write.table.version=6. In this case, hoodie.table.version should also be 6.

@danny0405
Copy link
Contributor

For creating new tables using 1.0, by default i.e. when none of the writer configs are set, hoodie.table.version should be 8. hoodie.write.table.version is automatically inferred to be 8.

For creating new tables using 1.0, if there is a need to create one with older format, then writers should be configured with hoodie.write.auto.upgrade=false and hoodie.write.table.version=6. In this case, hoodie.table.version should also be 6.

So it looks like there is even no need to explicitly set up the table version when creating new table with 1.0 code, just hard code it as 8 should be fine.

@codope
Copy link
Member

codope commented Jan 25, 2025

So it looks like there is even no need to explicitly set up the table version when creating new table with 1.0 code, just hard code it as 8 should be fine.

By default, we don't need to. Only when users want to write in older format with 1.0, they need to explicitly set the writer table version and auto upgrade

@Davis-Zhang-Onehouse
Copy link
Contributor Author

@danny0405

For creating new tables using 1.0, if there is a need to create one with older format, then writers should be configured with hoodie.write.auto.upgrade=false and hoodie.write.table.version=6. In this case, hoodie.table.version should also be 6.

It means in hudi 1.0, if a user wants to create version 6 table, they would set the config above and the system should honor that configuration in metaClient.initTable. This is why this change enforces when metaClient is created, we set the table version to write version. This is what is missing without the change

For creating new tables using 1.0, by default i.e. when none of the writer configs are set, hoodie.table.version should be 8. hoodie.write.table.version is automatically inferred to be 8.

With my change that explicitly set the write version, when user does not override the config, the version would be 8 automatically.
For sake of unblocking the change soon by 101 cut off which is today
@nsivabalan / @codope / @yihua does it make sense that we merge the PR first. If later Danny has any follow ups we can discuss tonight. Based on all the discussion so far I only see minor communication gap and we are heading in the right direction.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are aligned here, I am going ahead for 1.0.1. Config clean up for (write table version and tbl version) and doc updates we can take it up as a follow up.

@danny0405 @yihua : if you have any follow ups that needs to go into 1.0.1, do speak up :)

@nsivabalan nsivabalan dismissed stale reviews from danny0405 and yihua January 27, 2025 19:56

Going ahead with the changes. We will have follow ups for any ask from danny

@nsivabalan nsivabalan merged commit bc5566d into apache:master Jan 27, 2025
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants