-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid the usage of intermediate ScalarValue to improve performance of extracting statistics from parquet files #10711
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @xinlifoobar . I had some suggestions -- let me know what you think
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Outdated
Show resolved
Hide resolved
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Outdated
Show resolved
Hide resolved
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Outdated
Show resolved
Hide resolved
Update the benchmark to reflect the results of newer commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xinlifoobar -- I'll check this out later today
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Outdated
Show resolved
Hide resolved
MaxInt32StatsIterator::new(iterator) | ||
.map(|x| x.map(|x| i64::from(*x) * 24 * 60 * 60 * 1000)), | ||
))), | ||
DataType::Timestamp(_, _) => Ok(Arc::new(Int64Array::from_iter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected this to be a Timestamp array rather than an Int64 array 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is consistent with the existing code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created another PR to fix the timestamp statistic types. Let me know if you want it to be part of this PR or a following PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome -- thank you so much @xinlifoobar -- good team effort ✋
I took the liberty of also porting over the Decimal statistics extraction and removed the old get_statistic
code and pushed that to your branch
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Outdated
Show resolved
Hide resolved
MaxInt32StatsIterator::new(iterator) | ||
.map(|x| x.map(|x| i64::from(*x) * 24 * 60 * 60 * 1000)), | ||
))), | ||
DataType::Timestamp(_, _) => Ok(Arc::new(Int64Array::from_iter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is consistent with the existing code
🤔 I seem to have broken a test -- I will review |
🤔 I think what is happening is that statistics for dictionary types are not being handled correctly. Investigting |
I messed this up, clearning review until we sort out regression
Thanks @alamb for the extensive help. I have updated the latest benchmark from my local machine in the descriptions. |
I'd like the idea to use the partial specialization for Decimal, I am actually thinking of passing an array to the |
I think using a Let's ensure it doesn't need |
the other obvious thing to do in this PR might be to use a macro to avoid the copy/paste between min_statistics and max_statistics... So I guess the question is "shall we keep working on this PR or shall we plan to work on it as a follow on PR" 🤔 |
LGTM. It was clear to me what it is missing. Thanks! This PR is mixed with too many items I think... I will move out some of the code, e.g., for timestamps, later in separated PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have now filled out support for the other data types, so perhaps we can revisit the code in this PR again. What do you think @xinlifoobar ?
Converting to draft as we figure out next steps |
Hey @alamb, I just republished this PR with some major changes including:
I also play around the |
Seems there is a CI issue, not related to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks grat to me -- thank you @xinlifoobar
Thank you @xinlifoobar and @alamb 🚀 |
… extracting statistics from parquet files (apache#10711) * Fix incorrect statistics read for unsigned integers columns in parquet * Staging the change for faster stat * Improve performance of extracting statistics from parquet files * Revert "Improve performance of extracting statistics from parquet files" This reverts commit 2faec57. * Revert "Staging the change for faster stat" This reverts commit 095ac39. * Refine using the iterator idea * Add the rest types * Consolidate Decimal statistics extraction * clippy * Simplify * Fix dictionary type * Fix incorrect statistics read for timestamp columns in parquet * Add exhaustive match * Update latest datatypes * fix bad comment * Remove duplications using paste * Fix comment * Update Cargo.lock * fix docs --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes #10626
Rationale for this change
What changes are included in this PR?
Replace the
get_statatistics
macro byget_statistics_iter
to pass the whole iterator as argument and avoid the usage of intermediateScalarValue
. Some improvements comparing to currentmain
branch.Are these changes tested?
The changes are tested against current unit tests in the
statistics.rs
andarrow_statistics.rs
with only minor changes to the testcases themselves.Are there any user-facing changes?