Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Fix cleanup of orphaned statistics files in dropTableData #12132

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

lliangyu-lin
Copy link
Contributor

Description

Currently, Iceberg dropTableData() does not properly delete statistics files (.stats) that are replaced by newer statistics files. When updateStatistics() is called multiple times on the same snapshot, the older statistics files are removed from metadata.json but remain in storage, leading to orphaned files.

As Issue #11876 points out, #9305 and #9409 remove only the latest Puffin file. Old Puffin files still remain.
This PR attempts to address this issue by ensuring that all statistics and partition statistics files referenced in historical metadata are deleted by looking through all metadata files.

An alternative solution would be to explicitly track old (replaced) statistics files in metadata files, but will require changing the metadata spec.

Testing

  • ./gradlew spotlessApply -DallModules passed
  • ./gradlew build -x test -x integrationTest passed
  • ./gradlew :iceberg-core:test --tests "org.apache.iceberg.hadoop.TestCatalogUtilDropTable" passed

Closes #11876

@github-actions github-actions bot added the core label Jan 29, 2025
@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException {
.containsAll(partitionStatsLocations);
}

@Test
Copy link
Contributor

@ebyhr ebyhr Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestCatalogUtilDropTable exists under hadoop package. Can we move to CatalogTests or somewhere so we can test more catalogs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!
Regarding the comment, I'm not sure if move to CatalogTests is really necessary since there isn't any catalog specific logic in dropTableData

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong here but I don't think all the catalog implementations use this function to purge files. I haven't found any usage for RESTCatalog for instance. Adding this to CatalogTests would also result running this for non-affected catalog IMO.

@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException {
.containsAll(partitionStatsLocations);
}

@Test
public void dropTableDataDeletesAllPuffinFiles() throws IOException {
Copy link
Contributor

@ebyhr ebyhr Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Puffin is used for deletion vectors in V3 sepc. It would be nice to rename the test to ...StatsFiles if it doesn't contain DV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll rename the test instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe dropTableDataDeletesStatsFromAllMetadataFiles ?

Copy link
Collaborator

@gaborkaszab gaborkaszab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @lliangyu-lin !

I took a look and left some comments on the code itself. However, I'm a bit hesitant about the approach TBH. The reason is that now even for engines that don't use Iceberg's Puffin stats we unconditionally add extra work for the dropTable() operation. I think iterating through and doing read IO for all the previous metadata.jsons could be a very expensive operation.

Frankly, I don't have an answer to this problem I'd be comfortable with either :) Any of the ideas I can think of has serious drawbacks too unfortunately.

  • We could introduce a flag to trigger iterating the historical metadata.jsons, so in case an engine or user doesn't want to spend time on dropping old stats file, they can turn this overhead off. With this we'd introduce yet another flag though.
  • Since it's the updateStatistics and updatePartitionStatistics that simply forget about the existing stat files while updating the current ones, we could enhance these to also delete the no longer referenced ones. However, in case we have some data/metadata corruption and we decide to move back to an older metadata.json we could end up referencing missing stat files. (I have to check if ExpireSnapshots could cause the very same issue I described here.)
  • We could keep track of the unreferenced stat file locations in the TableMetadata so that when dropping the table we can use the latest one without the need of iterating the historical ones. However, this could accumulate forever increasing the size of the table metadata unreasonably.

Probably there are some other, more sophisticated answers to this problem than the ones I listed above. Let me ping @ajantha-bhat if he has anything on his roadmap to take care of this issue.

CC @nastra would you mind taking a look at my writeup? I might overthink this.

@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException {
.containsAll(partitionStatsLocations);
}

@Test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong here but I don't think all the catalog implementations use this function to purge files. I haven't found any usage for RESTCatalog for instance. Adding this to CatalogTests would also result running this for non-affected catalog IMO.

metadataToDelete.add(previousFile.file());

// Skip missing metadata files
if (!io.newInputFile(previousFile.file()).exists()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it doesn't exist, no need to add to metadataToDelete on L117. I'd move that line after this check.

// Process the latest metadata file
metadataToDelete.add(metadata.metadataFileLocation());
if (io.newInputFile(metadata.metadataFileLocation()).exists()) {
TableMetadata latestMetadata = TableMetadataParser.read(io, metadata.metadataFileLocation());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to do an IO here to read the latestMetadata? We already have it loaded in metadata.

partitionStatsToDelete.addAll(previousMetadata.partitionStatisticsFiles());
}

// Process the latest metadata file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of having dedicated code to take care of the latest metadata file, can't we simply initialize metadataToDelete and the same for stats to have these instead of having this code here? (the existence check and io read probably isn't needed anyway. See my below comment.)

continue;
}

TableMetadata previousMetadata = TableMetadataParser.read(io, previousFile.file());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need the existence check on L120 at all. We could just call this read() and handle the exception if the file doesn't exist. This could reduce the number of operation made on the storage.

@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) {

LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));

// Collect all metadata files and extract historical statistics files
for (TableMetadata.MetadataLogEntry previousFile : metadata.previousFiles()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could be doing a lot of operations on the storage with this for loop in a sequential manner. We could use a Tasks.foreach() to do it in parallel.

@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException {
.containsAll(partitionStatsLocations);
}

@Test
public void dropTableDataDeletesAllPuffinFiles() throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe dropTableDataDeletesStatsFromAllMetadataFiles ?

@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) {

LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));

// Collect all metadata files and extract historical statistics files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a ReachableFileUtil.metadataFileLocations() function that has a recursive mode that not just iterates the previousFiles() but also looks into those metadata and their previousFiles() too. I wonder in what cases would it give different results for the recursive run compared to the non-recursive, but I guess it was implemented for some reason :)
I think we should check if this iteration is enough here or if we should also do a recursive one to not miss any metadata locations.

table.io());
table
.updateStatistics()
.setStatistics(oldStatisticsFile.snapshotId(), oldStatisticsFile)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is deprecated. The other one without the snapshotId param should be used. This goes for other occurences below too.

@lliangyu-lin
Copy link
Contributor Author

lliangyu-lin commented Jan 30, 2025

@gaborkaszab I totally agree on what you mentioned. I have my own doubts on the approach as well and it's indeed very expansive, so I was hoping we can get some more eyes and opinions on this issue.

I have the exact same thoughts on the 3 approaches listed. So far, I'm kind of lenient towards adding unreferenced stats file in latest metadata file. Although it will be good to know how common will we actually updating the stats file in the same snapshot and creating unreferenced stats files.

I think we might need to align on the approach before making any additional changes.

@gaborkaszab
Copy link
Collaborator

Another issue I found with the current approach is that if write.metadata.delete-after-commit.enabled is true we just keep a certain number of metadata.jsons for a table, controlled by write.metadata.previous-versions-max. So even if we iterate through the previous metadata files, there could easily be ones that are not part of the 'previousFiles' list.

@lliangyu-lin But in general, since the fix for this issue is not straightforward, I recommend you to start a [DISCUSS] thread on dev@ mail list. I found that questions there usually have more visibility than on PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CatalogUtil:dropTableData method doesn't remove old Puffin files
3 participants