-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sdks/python: enable recursive deletion for GCSFileSystem Paths #33611
base: master
Are you sure you want to change the base?
sdks/python: enable recursive deletion for GCSFileSystem Paths #33611
Conversation
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
In this commit, we enable recursive deletion for GCS (Google Cloud Storage) paths, including directories and blobs. Changes include: - Updated the `delete` method to support recursive deletion of GCS directories (prefixes). - If the path points to a directory, all blobs under that prefix are deleted. - Refactored logic to handle both single blob and directory deletion cases.
In this commit, we update the delete test to verify recursive deletion of directories (prefixes) in GCS. Changes include: - Added test for deleting a GCS directory (prefix) with multiple files. - Verified files under a directory are deleted recursively when using the delete method.
1ea7e3b
to
01a4b65
Compare
Assigning reviewers. If you would like to opt out of this review, comment R: @tvalentyn for label python. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Hi @liferoad, It seems the test cases passed now. I would love to receive any feedback on that PR! 🙏 |
Ack. Thanks for contributing to Beam. I will review the code today. |
Hi @shunping, I understand you've been busy last week. Any updates on this PR review? 🙏 |
|
||
# Check if the blob is a directory (prefix) by listing objects | ||
# under that prefix. | ||
blobs = list(bucket.list_blobs(prefix=blob_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid this line could potentially impact the performance of existing pipelines. Particularly, we now have an extra HTTP request to GCS to list blobs in a bucket no matter what. If an existing pipeline is using gcsio.delete() to delete a directory with a large number of files, then the change will double the HTTP requests to GCS, which may lead to a request quota exceeded error.
I think a safer approach is to add a new api in gcsio particularly for this function, then change delete() in gcsfilesystem.py to use this api. Notice that the original issue #27605 is related to the behavior of delete() under gcsfilesystem.py. We can then recommend users to use this api while deleting a directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an existing pipeline is using gcsio.delete() to delete a directory with a large number of files, then the change will double the HTTP requests to GCS, which may lead to a request quota exceeded error.
Makes sense 👍
Notice that the original issue #27605 is related to the behavior of delete() under gcsfilesystem.py.
Ooops missed that. 🙏
I think a safer approach is to add a new api in gcsio particularly for this function, then change delete() in gcsfilesystem.py to use this api. We can then recommend users to use this api while deleting a directory.
How about adding an optional parameter, such as recursive
, with a default value of false
to ensure backward compatibility and avoid any performance impact on existing pipelines using the delete
function in gcsfilesystem.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding an optional parameter, such as recursive, with a default value of false to ensure backward compatibility and avoid any performance impact on existing pipelines using the delete function in gcsfilesystem.py?
@shunping What do you think about this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the recursive
parameter with default as false
sounds good to me as it extends the functionality of gcsio.delete()
without breaking compatibility.
To fix #27605, however, you will also need to make changes to gcsfilesystem.py
to leverage the new functionality.
For example, you can do something similar to https://github.com/apache/beam/pull/29477/files in gcsfilesystem.delete()
for path in paths:
if path.endswith('/'):
# This is a directory. Remove all of its contents including
# objects and subdirectories.
self._gcsIO().delete(path, recursive=True)
return
else:
...
Sorry for the late reply, @mohamedawnallah. I left a comment about potential performance impact with the new change. |
Reminder, please take a look at this pr: @tvalentyn @shunping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the changes and I am happy to review them again when ready.
Description
In this PR, we enable recursive deletion for Google Cloud Storage (GCS) paths, including directories and blobs. The
delete
method is updated to remove all blobs under a directory (prefix) when deleting GCS directories.Additionally, we update the delete test to verify the recursive deletion of directories containing multiple files.
Changes include:
delete
method to support recursive directory deletion.Fixes #27605
Current Behavior
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.