-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Fix cleanup of orphaned statistics files in dropTableData #12132
base: main
Are you sure you want to change the base?
Core: Fix cleanup of orphaned statistics files in dropTableData #12132
Conversation
@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException { | |||
.containsAll(partitionStatsLocations); | |||
} | |||
|
|||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestCatalogUtilDropTable
exists under hadoop
package. Can we move to CatalogTests
or somewhere so we can test more catalogs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
Regarding the comment, I'm not sure if move to CatalogTests
is really necessary since there isn't any catalog specific logic in dropTableData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong here but I don't think all the catalog implementations use this function to purge files. I haven't found any usage for RESTCatalog for instance. Adding this to CatalogTests would also result running this for non-affected catalog IMO.
@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException { | |||
.containsAll(partitionStatsLocations); | |||
} | |||
|
|||
@Test | |||
public void dropTableDataDeletesAllPuffinFiles() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Puffin is used for deletion vectors in V3 sepc. It would be nice to rename the test to ...StatsFiles
if it doesn't contain DV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll rename the test instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe dropTableDataDeletesStatsFromAllMetadataFiles
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, @lliangyu-lin !
I took a look and left some comments on the code itself. However, I'm a bit hesitant about the approach TBH. The reason is that now even for engines that don't use Iceberg's Puffin stats we unconditionally add extra work for the dropTable() operation. I think iterating through and doing read IO for all the previous metadata.jsons could be a very expensive operation.
Frankly, I don't have an answer to this problem I'd be comfortable with either :) Any of the ideas I can think of has serious drawbacks too unfortunately.
- We could introduce a flag to trigger iterating the historical metadata.jsons, so in case an engine or user doesn't want to spend time on dropping old stats file, they can turn this overhead off. With this we'd introduce yet another flag though.
- Since it's the updateStatistics and updatePartitionStatistics that simply forget about the existing stat files while updating the current ones, we could enhance these to also delete the no longer referenced ones. However, in case we have some data/metadata corruption and we decide to move back to an older metadata.json we could end up referencing missing stat files. (I have to check if ExpireSnapshots could cause the very same issue I described here.)
- We could keep track of the unreferenced stat file locations in the TableMetadata so that when dropping the table we can use the latest one without the need of iterating the historical ones. However, this could accumulate forever increasing the size of the table metadata unreasonably.
Probably there are some other, more sophisticated answers to this problem than the ones I listed above. Let me ping @ajantha-bhat if he has anything on his roadmap to take care of this issue.
CC @nastra would you mind taking a look at my writeup? I might overthink this.
@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException { | |||
.containsAll(partitionStatsLocations); | |||
} | |||
|
|||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong here but I don't think all the catalog implementations use this function to purge files. I haven't found any usage for RESTCatalog for instance. Adding this to CatalogTests would also result running this for non-affected catalog IMO.
metadataToDelete.add(previousFile.file()); | ||
|
||
// Skip missing metadata files | ||
if (!io.newInputFile(previousFile.file()).exists()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it doesn't exist, no need to add to metadataToDelete
on L117. I'd move that line after this check.
// Process the latest metadata file | ||
metadataToDelete.add(metadata.metadataFileLocation()); | ||
if (io.newInputFile(metadata.metadataFileLocation()).exists()) { | ||
TableMetadata latestMetadata = TableMetadataParser.read(io, metadata.metadataFileLocation()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed to do an IO here to read the latestMetadata? We already have it loaded in metadata
.
partitionStatsToDelete.addAll(previousMetadata.partitionStatisticsFiles()); | ||
} | ||
|
||
// Process the latest metadata file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of having dedicated code to take care of the latest metadata file, can't we simply initialize metadataToDelete
and the same for stats to have these instead of having this code here? (the existence check and io read probably isn't needed anyway. See my below comment.)
continue; | ||
} | ||
|
||
TableMetadata previousMetadata = TableMetadataParser.read(io, previousFile.file()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need the existence check on L120 at all. We could just call this read() and handle the exception if the file doesn't exist. This could reduce the number of operation made on the storage.
@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) { | |||
|
|||
LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete)); | |||
|
|||
// Collect all metadata files and extract historical statistics files | |||
for (TableMetadata.MetadataLogEntry previousFile : metadata.previousFiles()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could be doing a lot of operations on the storage with this for loop in a sequential manner. We could use a Tasks.foreach() to do it in parallel.
@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException { | |||
.containsAll(partitionStatsLocations); | |||
} | |||
|
|||
@Test | |||
public void dropTableDataDeletesAllPuffinFiles() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe dropTableDataDeletesStatsFromAllMetadataFiles
?
@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) { | |||
|
|||
LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete)); | |||
|
|||
// Collect all metadata files and extract historical statistics files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a ReachableFileUtil.metadataFileLocations()
function that has a recursive mode that not just iterates the previousFiles()
but also looks into those metadata and their previousFiles()
too. I wonder in what cases would it give different results for the recursive run compared to the non-recursive, but I guess it was implemented for some reason :)
I think we should check if this iteration is enough here or if we should also do a recursive one to not miss any metadata locations.
table.io()); | ||
table | ||
.updateStatistics() | ||
.setStatistics(oldStatisticsFile.snapshotId(), oldStatisticsFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is deprecated. The other one without the snapshotId param should be used. This goes for other occurences below too.
@gaborkaszab I totally agree on what you mentioned. I have my own doubts on the approach as well and it's indeed very expansive, so I was hoping we can get some more eyes and opinions on this issue. I have the exact same thoughts on the 3 approaches listed. So far, I'm kind of lenient towards adding unreferenced stats file in latest metadata file. Although it will be good to know how common will we actually updating the stats file in the same snapshot and creating unreferenced stats files. I think we might need to align on the approach before making any additional changes. |
Another issue I found with the current approach is that if @lliangyu-lin But in general, since the fix for this issue is not straightforward, I recommend you to start a [DISCUSS] thread on dev@ mail list. I found that questions there usually have more visibility than on PRs. |
Description
Currently, Iceberg
dropTableData()
does not properly delete statistics files (.stats
) that are replaced by newer statistics files. WhenupdateStatistics()
is called multiple times on the same snapshot, the older statistics files are removed frommetadata.json
but remain in storage, leading to orphaned files.As Issue #11876 points out, #9305 and #9409 remove only the latest Puffin file. Old Puffin files still remain.
This PR attempts to address this issue by ensuring that all statistics and partition statistics files referenced in historical metadata are deleted by looking through all metadata files.
An alternative solution would be to explicitly track old (replaced) statistics files in metadata files, but will require changing the metadata spec.
Testing
./gradlew spotlessApply -DallModules
passed./gradlew build -x test -x integrationTest
passed./gradlew :iceberg-core:test --tests "org.apache.iceberg.hadoop.TestCatalogUtilDropTable"
passedCloses #11876