Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add basic partition pruning support #713

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Feb 21, 2025

What changes are proposed in this pull request?

Add basic support for partition pruning by combining two pieces of existing infra:

  1. The log replay row visitor already needs to parse partition values and already filters out unwanted rows
  2. The default predicate evaluator works directly with scalars

Result: partition pruning gets applied during log replay, just before deduplication so we don't have to remember pruned files.

TODO: The implementation currently has a flaw, in case the history contains a table-replace that affected partition columns. For example, changing a value column into a non-nullable partition column, or an incompatible type change to a partition column. In such cases, the remove actions generated by the table-replace operation (for old files) would have the wrong type or even be entirely absent. While the code can handle an absent partition value, an incompatibly typed value would cause a parsing error that fails the whole query. Somehow, we need to suppress the error until we prove the action survived log replay. Related: #712

NOTE: While this is a convenient way to achieve partition pruning in the immediate term, Delta checkpoints can provide strongly-typed stats_parsed and partitionValues_parsed columns which would have a completely different access.

  • For stats vs. stats_parsed, the likely solution is simple enough because we already json-parse stats into a strongly-typed nested struct in order to evaluate the data skipping predicate over its record batch. We just avoid the parsing overhead if stats_parsed is already available.
  • The partitionValues field poses a bigger challenge, because it's a string-string map, not a JSON literal. In order to turn it into a strongly-typed nested struct, we would need a SQL expression that can extract the string values and try-cast them to the desired types. That's ugly enough we might prefer to keep completely different code paths for parsed vs. string partition values, but then there's a risk that partition pruning behavior changes depending on which path got invoked.

How was this change tested?

New unit tests, and adjusted one unit test that assumed no partition pruning.

@scovich scovich added the merge hold Don't allow the PR to merge label Feb 21, 2025
@scovich scovich requested review from nicklan and hntd187 February 21, 2025 22:53
Copy link

codecov bot commented Feb 21, 2025

Codecov Report

Attention: Patch coverage is 86.36364% with 9 lines in your changes missing coverage. Please review.

Project coverage is 84.38%. Comparing base (baa3fc3) to head (c069278).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/scan/log_replay.rs 86.15% 6 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #713      +/-   ##
==========================================
+ Coverage   84.35%   84.38%   +0.03%     
==========================================
  Files          75       75              
  Lines       17657    17702      +45     
  Branches    17657    17702      +45     
==========================================
+ Hits        14894    14938      +44     
  Misses       2053     2053              
- Partials      710      711       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump merge hold Don't allow the PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant