-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: API for collecting statistics/index for metadata of a parquet file + tests #10537
Conversation
datafusion/core/src/datasource/physical_plan/parquet/arrow_statistics.rs
Outdated
Show resolved
Hide resolved
} | ||
|
||
//////////////// WRITE STATISTICS /////////////////////// | ||
let file_meta = writer.close().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my surprise, the write and read statistics even though have the same content are stored in different structures. Write is parquet::format::statistics
and read is parquet::file::statistics::Statistics
. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One is the native parquet encoding, one is the Rust version of it. In general you should avoid using the format::statistics
directly
pub fn parquet_stats_to_arrow<'a>( | ||
arrow_datatype: &DataType, | ||
statistics: impl IntoIterator<Item = Option<&'a ParquetStatistics>>, | ||
) -> Result<ArrowStatistics> { | ||
todo!() // MY TODO next | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To emphasise the point I made when this API was originally proposed, you need more than just the ParquetStatistics in order to correctly interpret the data. You need at least the FileMetadata to get the https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html#method.column_order in order to be able to even interpret what the statistics mean for a given column.
Additionally you need to actually have the parquet schema as the arrow datatype may not match what the parquet data is encoded as. The parquet schema is authoritative when reading parquet data, the arrow datatype is purely what the data should be coerced to once read from parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of "column order" I think we initially should do what DataFusion currently does with ColumnOrder (which is ignore it) and file a ticket to handle it longer term
Including the parquet schema is a good idea. I think this will become more obvious as we begin writing these tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is very easy to get FileMetaData
from the parquet reader. I agree the sort order is not needed (yet) but I will see what we needs we we go and add them in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #10586
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @NGA-TRAN
I also filed #10546 and will get that example in place so we can show how this API will be used (in addition to hooking it into the existing ListingTable and ParquetExec) which I think will help design the API
As @tustvold mentions, the actual signature of parquet_stats_to_arrow
will likely need to change, but I think that is ok
datafusion/core/src/datasource/physical_plan/parquet/arrow_statistics.rs
Outdated
Show resolved
Hide resolved
datafusion/core/src/datasource/physical_plan/parquet/arrow_statistics.rs
Outdated
Show resolved
Hide resolved
datafusion/core/src/datasource/physical_plan/parquet/arrow_statistics.rs
Outdated
Show resolved
Hide resolved
datafusion/core/src/datasource/physical_plan/parquet/arrow_statistics.rs
Outdated
Show resolved
Hide resolved
Refine statistics extraction API and tests
@alamb : The PR is ready for review. Some notes:
|
|
||
Test { | ||
reader, | ||
// mins are [-5, -4, 0, 5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be related to #9779 (something related to how parquet handles signed integers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good hint for me to understand the issue more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #10585
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @NGA-TRAN -- I think this PR is a significant step forward and shows well how some of the APIs can be used and the limitations
Here are my suggested next steps:
- I will file a ticket for incorrect statistics from Int8 / Int16
- I will file a ticket about ignoring ColumnOrder as mentioned by @tustvold in feat: API for collecting statistics/index for metadata of a parquet file + tests #10537 (comment)
- I will file a ticket about potentially incorrect date statistics being read from parquet files
- I'll try and make a PR that switches the existing code over to using this new API (to verify it basically fits)
Test { | ||
reader, | ||
// mins are [18262, 18565,] | ||
expected_min: Arc::new(Int32Array::from(vec![18262, 18565])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually expect the returned type to be Date32Array
as the underlying arrow type is Date32 -- I don't think we need to make this change as part of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let reader = parquet_file_many_columns(Scenario::Dates, row_per_group).await; | ||
Test { | ||
reader, | ||
expected_min: Arc::new(Int64Array::from(vec![18262, 18565])), // panic here because the actual data is Int32Array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, I would expect this to be Date64Array (not Int64Array).
expected_null_counts: UInt64Array::from(vec![2, 2]), | ||
expected_row_counts: UInt64Array::from(vec![13, 7]), | ||
} | ||
.run_col_not_found("not_a_column"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Andrew Lamb <[email protected]>
Test { | ||
reader, | ||
// mins are [18262, 18565,] | ||
expected_min: Arc::new(Int32Array::from(vec![18262, 18565])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have filed the following tickets
I think this PR is now ready to go. I plan to merge it in when the CI passes |
…ile + tests (apache#10537) * test: some tests to write data to a parquet file and read its metadata * feat: API to convert parquet stats to arrow stats * Refine statistics extraction API and tests * Implement null counts * port test * test: add more tests for the arrow statistics * chore: fix format and test output * chore: rename test helpers * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * Apply suggestions from code review * Apply suggestions from code review --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
First PR of ##10453. This PR does:
Rationale for this change
The API will help us to prune more files and make it more effectively
What changes are included in this PR?
RequestedStatistics
RequestedStatistics
formAre these changes tested?
Yes
Are there any user-facing changes?
New API