Core: Fix cleanup of orphaned statistics files in dropTableData #12132

lliangyu-lin · 2025-01-29T21:04:44Z

Description

Currently, Iceberg dropTableData() does not properly delete statistics files (.stats) that are replaced by newer statistics files. When updateStatistics() is called multiple times on the same snapshot, the older statistics files are removed from metadata.json but remain in storage, leading to orphaned files.

As Issue #11876 points out, #9305 and #9409 remove only the latest Puffin file. Old Puffin files still remain.
This PR attempts to address this issue by ensuring that all statistics and partition statistics files referenced in historical metadata are deleted by looking through all metadata files.

An alternative solution would be to explicitly track old (replaced) statistics files in metadata files, but will require changing the metadata spec.

Testing

./gradlew spotlessApply -DallModules passed
./gradlew build -x test -x integrationTest passed
./gradlew :iceberg-core:test --tests "org.apache.iceberg.hadoop.TestCatalogUtilDropTable" passed

Closes #11876

…puffin-files

ebyhr · 2025-01-30T02:26:04Z

core/src/test/java/org/apache/iceberg/hadoop/TestCatalogUtilDropTable.java

@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException {
        .containsAll(partitionStatsLocations);
  }

+  @Test


TestCatalogUtilDropTable exists under hadoop package. Can we move to CatalogTests or somewhere so we can test more catalogs?

Thanks for the review!
Regarding the comment, I'm not sure if move to CatalogTests is really necessary since there isn't any catalog specific logic in dropTableData

I might be wrong here but I don't think all the catalog implementations use this function to purge files. I haven't found any usage for RESTCatalog for instance. Adding this to CatalogTests would also result running this for non-affected catalog IMO.

ebyhr · 2025-01-30T02:30:45Z

core/src/test/java/org/apache/iceberg/hadoop/TestCatalogUtilDropTable.java

@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException {
        .containsAll(partitionStatsLocations);
  }

+  @Test
+  public void dropTableDataDeletesAllPuffinFiles() throws IOException {


Puffin is used for deletion vectors in V3 sepc. It would be nice to rename the test to ...StatsFiles if it doesn't contain DV.

Yes, I'll rename the test instead.

Maybe dropTableDataDeletesStatsFromAllMetadataFiles ?

gaborkaszab

Thanks for working on this, @lliangyu-lin !

I took a look and left some comments on the code itself. However, I'm a bit hesitant about the approach TBH. The reason is that now even for engines that don't use Iceberg's Puffin stats we unconditionally add extra work for the dropTable() operation. I think iterating through and doing read IO for all the previous metadata.jsons could be a very expensive operation.

Frankly, I don't have an answer to this problem I'd be comfortable with either :) Any of the ideas I can think of has serious drawbacks too unfortunately.

We could introduce a flag to trigger iterating the historical metadata.jsons, so in case an engine or user doesn't want to spend time on dropping old stats file, they can turn this overhead off. With this we'd introduce yet another flag though.
Since it's the updateStatistics and updatePartitionStatistics that simply forget about the existing stat files while updating the current ones, we could enhance these to also delete the no longer referenced ones. However, in case we have some data/metadata corruption and we decide to move back to an older metadata.json we could end up referencing missing stat files. (I have to check if ExpireSnapshots could cause the very same issue I described here.)
We could keep track of the unreferenced stat file locations in the TableMetadata so that when dropping the table we can use the latest one without the need of iterating the historical ones. However, this could accumulate forever increasing the size of the table metadata unreasonably.

Probably there are some other, more sophisticated answers to this problem than the ones I listed above. Let me ping @ajantha-bhat if he has anything on his roadmap to take care of this issue.

CC @nastra would you mind taking a look at my writeup? I might overthink this.

gaborkaszab · 2025-01-30T08:41:08Z

core/src/test/java/org/apache/iceberg/hadoop/TestCatalogUtilDropTable.java

@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException {
        .containsAll(partitionStatsLocations);
  }

+  @Test


I might be wrong here but I don't think all the catalog implementations use this function to purge files. I haven't found any usage for RESTCatalog for instance. Adding this to CatalogTests would also result running this for non-affected catalog IMO.

gaborkaszab · 2025-01-30T09:57:09Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

+      metadataToDelete.add(previousFile.file());
+
+      // Skip missing metadata files
+      if (!io.newInputFile(previousFile.file()).exists()) {


If it doesn't exist, no need to add to metadataToDelete on L117. I'd move that line after this check.

gaborkaszab · 2025-01-30T10:01:02Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

+    // Process the latest metadata file
+    metadataToDelete.add(metadata.metadataFileLocation());
+    if (io.newInputFile(metadata.metadataFileLocation()).exists()) {
+      TableMetadata latestMetadata = TableMetadataParser.read(io, metadata.metadataFileLocation());


Is it needed to do an IO here to read the latestMetadata? We already have it loaded in metadata.

gaborkaszab · 2025-01-30T10:02:36Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

+      partitionStatsToDelete.addAll(previousMetadata.partitionStatisticsFiles());
+    }
+
+    // Process the latest metadata file


I think instead of having dedicated code to take care of the latest metadata file, can't we simply initialize metadataToDelete and the same for stats to have these instead of having this code here? (the existence check and io read probably isn't needed anyway. See my below comment.)

gaborkaszab · 2025-01-30T10:17:01Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

+        continue;
+      }
+
+      TableMetadata previousMetadata = TableMetadataParser.read(io, previousFile.file());


I wonder if we need the existence check on L120 at all. We could just call this read() and handle the exception if the file doesn't exist. This could reduce the number of operation made on the storage.

gaborkaszab · 2025-01-30T10:18:16Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) {

    LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));

+    // Collect all metadata files and extract historical statistics files
+    for (TableMetadata.MetadataLogEntry previousFile : metadata.previousFiles()) {


We could be doing a lot of operations on the storage with this for loop in a sequential manner. We could use a Tasks.foreach() to do it in parallel.

gaborkaszab · 2025-01-30T10:19:47Z

core/src/test/java/org/apache/iceberg/hadoop/TestCatalogUtilDropTable.java

@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException {
        .containsAll(partitionStatsLocations);
  }

+  @Test
+  public void dropTableDataDeletesAllPuffinFiles() throws IOException {


Maybe dropTableDataDeletesStatsFromAllMetadataFiles ?

gaborkaszab · 2025-01-30T10:35:44Z

core/src/main/java/org/apache/iceberg/CatalogUtil.java

@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) {

    LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));

+    // Collect all metadata files and extract historical statistics files


I found a ReachableFileUtil.metadataFileLocations() function that has a recursive mode that not just iterates the previousFiles() but also looks into those metadata and their previousFiles() too. I wonder in what cases would it give different results for the recursive run compared to the non-recursive, but I guess it was implemented for some reason :)
I think we should check if this iteration is enough here or if we should also do a recursive one to not miss any metadata locations.

gaborkaszab · 2025-01-30T11:25:41Z

core/src/test/java/org/apache/iceberg/hadoop/TestCatalogUtilDropTable.java

+            table.io());
+    table
+        .updateStatistics()
+        .setStatistics(oldStatisticsFile.snapshotId(), oldStatisticsFile)


This function is deprecated. The other one without the snapshotId param should be used. This goes for other occurences below too.

lliangyu-lin · 2025-01-30T19:44:27Z

@gaborkaszab I totally agree on what you mentioned. I have my own doubts on the approach as well and it's indeed very expansive, so I was hoping we can get some more eyes and opinions on this issue.

I have the exact same thoughts on the 3 approaches listed. So far, I'm kind of lenient towards adding unreferenced stats file in latest metadata file. Although it will be good to know how common will we actually updating the stats file in the same snapshot and creating unreferenced stats files.

I think we might need to align on the approach before making any additional changes.

gaborkaszab · 2025-01-31T08:29:01Z

Another issue I found with the current approach is that if write.metadata.delete-after-commit.enabled is true we just keep a certain number of metadata.jsons for a table, controlled by write.metadata.previous-versions-max. So even if we iterate through the previous metadata files, there could easily be ones that are not part of the 'previousFiles' list.

@lliangyu-lin But in general, since the fix for this issue is not straightforward, I recommend you to start a [DISCUSS] thread on dev@ mail list. I found that questions there usually have more visibility than on PRs.

Leon Lin and others added 4 commits January 28, 2025 23:14

Add unit test for drop table data deletes all puffin files

c29bd75

Update drop table data to include all puffin files

c2f0855

Merge branch 'apache:main' into lliangyu/drop-table-data-deletes-all-…

fee3cc1

…puffin-files

Fix minor issue and update readability

7a32fea

github-actions bot added the core label Jan 29, 2025

Leon Lin added 2 commits January 29, 2025 13:58

Handle missing metadata file gracefully

521f612

Fix missing last metadata location

fdbbcaf

ebyhr reviewed Jan 30, 2025

View reviewed changes

gaborkaszab reviewed Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Fix cleanup of orphaned statistics files in dropTableData #12132

Core: Fix cleanup of orphaned statistics files in dropTableData #12132

lliangyu-lin commented Jan 29, 2025

ebyhr Jan 30, 2025 •

edited

Loading

lliangyu-lin Jan 30, 2025

gaborkaszab Jan 30, 2025

ebyhr Jan 30, 2025 •

edited

Loading

lliangyu-lin Jan 30, 2025

gaborkaszab Jan 30, 2025

gaborkaszab left a comment

gaborkaszab Jan 30, 2025

gaborkaszab Jan 30, 2025

gaborkaszab Jan 30, 2025

gaborkaszab Jan 30, 2025

gaborkaszab Jan 30, 2025

gaborkaszab Jan 30, 2025

gaborkaszab Jan 30, 2025

gaborkaszab Jan 30, 2025

gaborkaszab Jan 30, 2025

lliangyu-lin commented Jan 30, 2025 •

edited

Loading

gaborkaszab commented Jan 31, 2025

		@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) {

		LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete));

		// Collect all metadata files and extract historical statistics files

Core: Fix cleanup of orphaned statistics files in dropTableData #12132

Are you sure you want to change the base?

Core: Fix cleanup of orphaned statistics files in dropTableData #12132

Conversation

lliangyu-lin commented Jan 29, 2025

Description

Testing

ebyhr Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaborkaszab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lliangyu-lin commented Jan 30, 2025 • edited Loading

gaborkaszab commented Jan 31, 2025

ebyhr Jan 30, 2025 •

edited

Loading

ebyhr Jan 30, 2025 •

edited

Loading

lliangyu-lin commented Jan 30, 2025 •

edited

Loading