Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid the usage of intermediate ScalarValue to improve performance of extracting statistics from parquet files #10711

Merged
merged 25 commits into from
Jun 5, 2024

Conversation

xinlifoobar
Copy link
Contributor

@xinlifoobar xinlifoobar commented May 29, 2024

Which issue does this PR close?

Closes #10626

Rationale for this change

What changes are included in this PR?

Replace the get_statatistics macro by get_statistics_iter to pass the whole iterator as argument and avoid the usage of intermediate ScalarValue. Some improvements comparing to current main branch.

$ cargo bench --bench parquet_statistic -- --baseline main
Extract statistics for UInt64/extract_statistics/UInt64
                        time:   [742.16 ns 742.70 ns 743.31 ns]
                        change: [-17.418% -17.169% -16.905%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  6 (6.00%) high severe

Extract statistics for F64/extract_statistics/F64
                        time:   [749.50 ns 750.13 ns 750.80 ns]
                        change: [-42.206% -41.719% -41.305%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  5 (5.00%) high severe

Extract statistics for String/extract_statistics/String
                        time:   [1.0791 µs 1.0807 µs 1.0824 µs]
                        change: [-35.391% -34.481% -33.804%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 21 outliers among 100 measurements (21.00%)
  1 (1.00%) low severe
  6 (6.00%) low mild
  7 (7.00%) high mild
  7 (7.00%) high severe

Extract statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)
                        time:   [910.83 ns 912.00 ns 913.44 ns]
                        change: [-42.793% -41.238% -39.891%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  8 (8.00%) high severe

Are these changes tested?

The changes are tested against current unit tests in the statistics.rs and arrow_statistics.rs with only minor changes to the testcases themselves.

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label May 29, 2024
@xinlifoobar xinlifoobar changed the title Try to Improve performance of extracting statistics from parquet files Avoid the Usage of Intermediate ScalarValue to Improve performance of extracting statistics from parquet files May 29, 2024
@xinlifoobar xinlifoobar changed the title Avoid the Usage of Intermediate ScalarValue to Improve performance of extracting statistics from parquet files Avoid the usage of intermediate ScalarValue to improve performance of extracting statistics from parquet files May 29, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @xinlifoobar . I had some suggestions -- let me know what you think

@xinlifoobar
Copy link
Contributor Author

Update the benchmark to reflect the results of newer commit.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xinlifoobar -- I'll check this out later today

MaxInt32StatsIterator::new(iterator)
.map(|x| x.map(|x| i64::from(*x) * 24 * 60 * 60 * 1000)),
))),
DataType::Timestamp(_, _) => Ok(Arc::new(Int64Array::from_iter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected this to be a Timestamp array rather than an Int64 array 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is consistent with the existing code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created another PR to fix the timestamp statistic types. Let me know if you want it to be part of this PR or a following PR.

https://github.com/xinlifoobar/datafusion/pull/1/files

alamb
alamb previously approved these changes May 30, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome -- thank you so much @xinlifoobar -- good team effort ✋

I took the liberty of also porting over the Decimal statistics extraction and removed the old get_statistic code and pushed that to your branch

MaxInt32StatsIterator::new(iterator)
.map(|x| x.map(|x| i64::from(*x) * 24 * 60 * 60 * 1000)),
))),
DataType::Timestamp(_, _) => Ok(Arc::new(Int64Array::from_iter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is consistent with the existing code

@alamb
Copy link
Contributor

alamb commented May 30, 2024

🤔 I seem to have broken a test -- I will review

@alamb
Copy link
Contributor

alamb commented May 30, 2024

🤔 I think what is happening is that statistics for dictionary types are not being handled correctly. Investigting

@alamb alamb dismissed their stale review May 30, 2024 22:22

I messed this up, clearning review until we sort out regression

@xinlifoobar
Copy link
Contributor Author

xinlifoobar commented May 31, 2024

Thanks @alamb for the extensive help. I have updated the latest benchmark from my local machine in the descriptions.

@xinlifoobar
Copy link
Contributor Author

xinlifoobar commented May 31, 2024

This is awesome -- thank you so much @xinlifoobar -- good team effort ✋

I took the liberty of also porting over the Decimal statistics extraction and removed the old get_statistic code and pushed that to your branch

I'd like the idea to use the partial specialization for Decimal, I am actually thinking of passing an array to the make_stats_iterator macro to make it more generic. However, in opposite, we will lose some validation abilities.

@alamb
Copy link
Contributor

alamb commented May 31, 2024

I'd like the idea to use the partial specialization for Decimal, I am actually thinking of passing an array to the make_stats_iterator macro to make it more generic. However, in opposite, we will lose some validation abilities.

I think using a &[&ParquetStatistics] (aka an array of references to statistics) would be good.

Let's ensure it doesn't need &[ParquetStatistics] (an array of references to owned statistics) as I would like very much to be able to only evaluate this code for row groups that we haven't already filtered out (aka I want to be able to filter the statistics prior to passing down here)

@alamb
Copy link
Contributor

alamb commented May 31, 2024

the other obvious thing to do in this PR might be to use a macro to avoid the copy/paste between min_statistics and max_statistics...

So I guess the question is "shall we keep working on this PR or shall we plan to work on it as a follow on PR" 🤔

@xinlifoobar
Copy link
Contributor Author

Thanks again @xinlifoobar -- I went through this PR and I don't think we can merge it yet as is because it has a regression.

The current code will extract the values as the underlying parquet type (e.g. as an Int32Array) and then the predicate pruning code will cast it as necessary

However, when I pushed one more commit to make the types that are not yet handled explicit (rather than relying on a _) I see a bunch of types that aren't yet covered that are important (like Interval for example)

Thus I suggest:

  1. Break out your change to support extracting timestamps with/without timezones as its own PR (it is quite good)
  2. We then work on filling out tests for the other types (I'll file tickets)
  3. Then we can come back to this PR (or maybe we want to work on it in parallel)

I am sorry I didn't see this before and I am sorry that we clearly don't have adequate test coverage

LGTM. It was clear to me what it is missing. Thanks!

This PR is mixed with too many items I think... I will move out some of the code, e.g., for timestamps, later in separated PR.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have now filled out support for the other data types, so perhaps we can revisit the code in this PR again. What do you think @xinlifoobar ?

@alamb alamb marked this pull request as draft June 4, 2024 20:41
@alamb
Copy link
Contributor

alamb commented Jun 4, 2024

Converting to draft as we figure out next steps

@xinlifoobar xinlifoobar marked this pull request as ready for review June 5, 2024 03:51
@xinlifoobar
Copy link
Contributor Author

I think we have now filled out support for the other data types, so perhaps we can revisit the code in this PR again. What do you think @xinlifoobar ?

Hey @alamb, I just republished this PR with some major changes including:

  • Add types we missed in previous there.
  • Play around with the paste! macro which I found it may useful here to remove the duplicate codes. (Feel free to revert the change though).

I also play around the make_stats_iterator and make_decimal_stats_iterator with the paste macro... which made the code difficult to read. Here I recommend keeping them separately and clean. What do you think?

@xinlifoobar xinlifoobar changed the title Avoid the usage of intermediate ScalarValue to improve performance of extracting statistics from parquet files, support correct timestamp extraction Avoid the usage of intermediate ScalarValue to improve performance of extracting statistics from parquet files Jun 5, 2024
@xinlifoobar
Copy link
Contributor Author

Seems there is a CI issue, not related to this PR.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks grat to me -- thank you @xinlifoobar

@Dandandan Dandandan merged commit 9845e6e into apache:main Jun 5, 2024
25 checks passed
@Dandandan
Copy link
Contributor

Thank you @xinlifoobar and @alamb 🚀

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
… extracting statistics from parquet files (apache#10711)

* Fix incorrect statistics read for unsigned integers columns in parquet

* Staging the change for faster stat

* Improve performance of extracting statistics from parquet files

* Revert "Improve performance of extracting statistics from parquet files"

This reverts commit 2faec57.

* Revert "Staging the change for faster stat"

This reverts commit 095ac39.

* Refine using the iterator idea

* Add the rest types

* Consolidate Decimal statistics extraction

* clippy

* Simplify

* Fix dictionary type

* Fix incorrect statistics read for timestamp columns in parquet

* Add exhaustive match

* Update latest datatypes

* fix bad comment

* Remove duplications using paste

* Fix comment

* Update Cargo.lock

* fix docs

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of extracting statistics from parquet files
3 participants