Remove archive after it is extracted to save disk space #1351

PGijsbers · 2024-09-13T09:32:45Z

I don't really know that we need to make this configurable at this point. Let's not add more options for now, and see if we get requests.

PGijsbers · 2024-09-13T09:33:03Z

@prabhant This is what you are looking for, right?

codecov-commenter · 2024-09-13T09:46:25Z

Codecov Report

Attention: Patch coverage is 16.66667% with 10 lines in your changes missing coverage. Please review.

Project coverage is 84.20%. Comparing base (7764ddb) to head (731b3e1).
Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
openml/_api_calls.py	16.66%	10 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1351      +/-   ##
===========================================
- Coverage    84.28%   84.20%   -0.09%     
===========================================
  Files           38       38              
  Lines         5288     5298      +10     
===========================================
+ Hits          4457     4461       +4     
- Misses         831      837       +6

Flag	Coverage Δ
	`84.20% <16.66%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

prabhant · 2024-09-27T12:46:05Z

Looks good to me

PGijsbers · 2024-09-27T12:58:26Z

It just sprang to my mind that removing the file does conflict with the caching mechanism, so it will likely always download the archive now... What would be a good way to resolve that? Leaving some kind of empty marker with the same name could be a quick hack. @LennartPurucker @eddiebergman

eddiebergman · 2024-09-27T20:23:09Z

Interesting, I came across this exact problem recently where I don't know the contents of tarfiles directly and needed to check if their content were already present. Problem being I can't map the contents of the archive with the archive itself, and whether a new download should be triggered

I didn't implement a solution other than force: bool but your idea about a marker sounds promising. It would assume that the any new archive that might be downloaded are uniuly named, such that you can one-to-one map them to a marker file.

I don't have a good solution but please keep me posted if you come up with one!

LennartPurucker

Yea, I agree. A marker file for the file being already extracted would be needed and an option to still force the download (force bool or cleaning up the cache).

Ideally, we would have a uuid for the content of the zip file on the server that, if it changes, would prompt us to re-download. Then, we would just name the marker file using this uuid and check for a match before downloading and extracting again.

PGijsbers · 2024-09-28T16:15:32Z

In this case we can apparently use the metadata which contains a hash (at least, as far as I can tell, that's what it is). For a force refresh, we already support that at a get_dataset level, so I didn't think we need to do anything special here.

LennartPurucker

Very nice, LGTM, thanks!

Remove archive after it is extracted to save disk space

f490d66

PGijsbers marked this pull request as ready for review September 27, 2024 12:56

PGijsbers requested a review from LennartPurucker September 27, 2024 12:56

LennartPurucker requested changes Sep 28, 2024

View reviewed changes

PGijsbers added 4 commits September 28, 2024 17:57

Leave a marker after removing archive to avoid redownload

a7defec

Automatic refresh if expected marker is absent

6be724e

Be consistent about syntax use for path construction

8f78f0c

Merge branch 'develop' into add/cleanup

731b3e1

PGijsbers requested a review from LennartPurucker September 28, 2024 16:14

LennartPurucker approved these changes Sep 29, 2024

View reviewed changes

PGijsbers merged commit a3e57bb into develop Sep 29, 2024
7 of 14 checks passed

PGijsbers deleted the add/cleanup branch September 29, 2024 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove archive after it is extracted to save disk space #1351

Remove archive after it is extracted to save disk space #1351

PGijsbers commented Sep 13, 2024 •

edited

Loading

PGijsbers commented Sep 13, 2024

codecov-commenter commented Sep 13, 2024 •

edited

Loading

prabhant commented Sep 27, 2024

PGijsbers commented Sep 27, 2024 •

edited

Loading

eddiebergman commented Sep 27, 2024 •

edited

Loading

LennartPurucker left a comment

PGijsbers commented Sep 28, 2024

LennartPurucker left a comment

Remove archive after it is extracted to save disk space #1351

Remove archive after it is extracted to save disk space #1351

Conversation

PGijsbers commented Sep 13, 2024 • edited Loading

PGijsbers commented Sep 13, 2024

codecov-commenter commented Sep 13, 2024 • edited Loading

Codecov Report

prabhant commented Sep 27, 2024

PGijsbers commented Sep 27, 2024 • edited Loading

eddiebergman commented Sep 27, 2024 • edited Loading

LennartPurucker left a comment

Choose a reason for hiding this comment

PGijsbers commented Sep 28, 2024

LennartPurucker left a comment

Choose a reason for hiding this comment

PGijsbers commented Sep 13, 2024 •

edited

Loading

codecov-commenter commented Sep 13, 2024 •

edited

Loading

PGijsbers commented Sep 27, 2024 •

edited

Loading

eddiebergman commented Sep 27, 2024 •

edited

Loading