-
-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove archive after it is extracted to save disk space #1351
Conversation
@prabhant This is what you are looking for, right? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1351 +/- ##
===========================================
- Coverage 84.28% 84.20% -0.09%
===========================================
Files 38 38
Lines 5288 5298 +10
===========================================
+ Hits 4457 4461 +4
- Misses 831 837 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Looks good to me |
It just sprang to my mind that removing the file does conflict with the caching mechanism, so it will likely always download the archive now... What would be a good way to resolve that? Leaving some kind of empty marker with the same name could be a quick hack. @LennartPurucker @eddiebergman |
Interesting, I came across this exact problem recently where I don't know the contents of tarfiles directly and needed to check if their content were already present. Problem being I can't map the contents of the archive with the archive itself, and whether a new download should be triggered I didn't implement a solution other than I don't have a good solution but please keep me posted if you come up with one! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I agree. A marker file for the file being already extracted would be needed and an option to still force the download (force bool or cleaning up the cache).
Ideally, we would have a uuid for the content of the zip file on the server that, if it changes, would prompt us to re-download. Then, we would just name the marker file using this uuid and check for a match before downloading and extracting again.
In this case we can apparently use the metadata which contains a hash (at least, as far as I can tell, that's what it is). For a force refresh, we already support that at a get_dataset level, so I didn't think we need to do anything special here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, LGTM, thanks!
Closes #1348
I don't really know that we need to make this configurable at this point. Let's not add more options for now, and see if we get requests.