Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Overwrite partitions mode #3687

Merged
merged 3 commits into from
Jan 17, 2025
Merged

feat: Overwrite partitions mode #3687

merged 3 commits into from
Jan 17, 2025

Conversation

colin-ho
Copy link
Contributor

Closes #1768

Overwrite-partitions mode will only overwrite files in the partition directories that were written into as part of the write operation. E.g. partition "A" will be overwritten if and only if partition "A" was written into.

This PR also refactors the test code a bit.

@github-actions github-actions bot added the feat label Jan 15, 2025
Copy link

codspeed-hq bot commented Jan 15, 2025

CodSpeed Performance Report

Merging #3687 will improve performances by 42.87%

Comparing colin/overwrite-partitions (65f5da5) with main (6b302af)

Summary

⚡ 1 improvements
✅ 26 untouched benchmarks

Benchmarks breakdown

Benchmark main colin/overwrite-partitions Change
test_iter_rows_first_row[100 Small Files] 193.5 ms 135.4 ms +42.87%

Copy link

codecov bot commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 79.31034% with 6 lines in your changes missing coverage. Please review.

Project coverage is 77.94%. Comparing base (5702720) to head (65f5da5).
Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
daft/dataframe/dataframe.py 71.42% 4 Missing ⚠️
daft/filesystem.py 86.66% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3687      +/-   ##
==========================================
+ Coverage   77.79%   77.94%   +0.14%     
==========================================
  Files         729      725       -4     
  Lines       90408    90967     +559     
==========================================
+ Hits        70335    70904     +569     
+ Misses      20073    20063      -10     
Files with missing lines Coverage Δ
daft/filesystem.py 70.20% <86.66%> (+0.36%) ⬆️
daft/dataframe/dataframe.py 85.43% <71.42%> (-0.09%) ⬇️

... and 84 files with indirect coverage changes

daft/filesystem.py Outdated Show resolved Hide resolved
tests/io/test_write_modes.py Show resolved Hide resolved
daft/filesystem.py Outdated Show resolved Hide resolved
Comment on lines +378 to +390
all_file_paths = []
if overwrite_partitions:
# Get all files in ONLY the directories that were written to.

written_dirs = set(str(pathlib.Path(path).parent) for path in written_file_paths.to_pylist())
for dir in written_dirs:
file_selector = pafs.FileSelector(dir, recursive=True)
try:
all_file_paths.extend(
[info.path for info in fs.get_file_info(file_selector) if info.type == pafs.FileType.File]
)
except FileNotFoundError:
continue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@desmondcheongzx new implementation ready for review, where we only look for files to delete IF they are in the partition directories.

Copy link
Contributor

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for making it clearer!

@colin-ho colin-ho merged commit 412cef4 into main Jan 17, 2025
43 checks passed
@colin-ho colin-ho deleted the colin/overwrite-partitions branch January 17, 2025 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEAT] Allow for selection of append/overwrite/overwrite_partitions options when writing data
2 participants