Avoid the usage of intermediate ScalarValue to improve performance of extracting statistics from parquet files #10711

xinlifoobar · 2024-05-29T10:56:29Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Replace the get_statatistics macro by get_statistics_iter to pass the whole iterator as argument and avoid the usage of intermediate ScalarValue. Some improvements comparing to current main branch.

$ cargo bench --bench parquet_statistic -- --baseline main
Extract statistics for UInt64/extract_statistics/UInt64
                        time:   [742.16 ns 742.70 ns 743.31 ns]
                        change: [-17.418% -17.169% -16.905%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  6 (6.00%) high severe

Extract statistics for F64/extract_statistics/F64
                        time:   [749.50 ns 750.13 ns 750.80 ns]
                        change: [-42.206% -41.719% -41.305%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  5 (5.00%) high severe

Extract statistics for String/extract_statistics/String
                        time:   [1.0791 µs 1.0807 µs 1.0824 µs]
                        change: [-35.391% -34.481% -33.804%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 21 outliers among 100 measurements (21.00%)
  1 (1.00%) low severe
  6 (6.00%) low mild
  7 (7.00%) high mild
  7 (7.00%) high severe

Extract statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, String)
                        time:   [910.83 ns 912.00 ns 913.44 ns]
                        change: [-42.793% -41.238% -39.891%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  8 (8.00%) high severe

Are these changes tested?

The changes are tested against current unit tests in the statistics.rs and arrow_statistics.rs with only minor changes to the testcases themselves.

Are there any user-facing changes?

…at_perf

alamb

Thank you so much @xinlifoobar . I had some suggestions -- let me know what you think

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

This reverts commit 2faec57.

This reverts commit 095ac39.

xinlifoobar · 2024-05-30T11:15:06Z

Update the benchmark to reflect the results of newer commit.

alamb

Thanks @xinlifoobar -- I'll check this out later today

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

alamb · 2024-05-30T14:45:27Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+            MaxInt32StatsIterator::new(iterator)
+                .map(|x| x.map(|x| i64::from(*x) * 24 * 60 * 60 * 1000)),
+        ))),
+        DataType::Timestamp(_, _) => Ok(Arc::new(Int64Array::from_iter(


I would have expected this to be a Timestamp array rather than an Int64 array 🤔

It is consistent with the existing code

I created another PR to fix the timestamp statistic types. Let me know if you want it to be part of this PR or a following PR.

https://github.com/xinlifoobar/datafusion/pull/1/files

alamb

This is awesome -- thank you so much @xinlifoobar -- good team effort ✋

I took the liberty of also porting over the Decimal statistics extraction and removed the old get_statistic code and pushed that to your branch

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

alamb · 2024-05-30T17:01:14Z

datafusion/core/src/datasource/physical_plan/parquet/statistics.rs

+            MaxInt32StatsIterator::new(iterator)
+                .map(|x| x.map(|x| i64::from(*x) * 24 * 60 * 60 * 1000)),
+        ))),
+        DataType::Timestamp(_, _) => Ok(Arc::new(Int64Array::from_iter(


It is consistent with the existing code

alamb · 2024-05-30T18:51:04Z

🤔 I seem to have broken a test -- I will review

alamb · 2024-05-30T21:05:24Z

🤔 I think what is happening is that statistics for dictionary types are not being handled correctly. Investigting

I messed this up, clearning review until we sort out regression

xinlifoobar · 2024-05-31T02:05:03Z

Thanks @alamb for the extensive help. I have updated the latest benchmark from my local machine in the descriptions.

xinlifoobar · 2024-05-31T02:10:01Z

This is awesome -- thank you so much @xinlifoobar -- good team effort ✋

I took the liberty of also porting over the Decimal statistics extraction and removed the old get_statistic code and pushed that to your branch

I'd like the idea to use the partial specialization for Decimal, I am actually thinking of passing an array to the make_stats_iterator macro to make it more generic. However, in opposite, we will lose some validation abilities.

alamb · 2024-05-31T13:35:00Z

I'd like the idea to use the partial specialization for Decimal, I am actually thinking of passing an array to the make_stats_iterator macro to make it more generic. However, in opposite, we will lose some validation abilities.

I think using a &[&ParquetStatistics] (aka an array of references to statistics) would be good.

Let's ensure it doesn't need &[ParquetStatistics] (an array of references to owned statistics) as I would like very much to be able to only evaluate this code for row groups that we haven't already filtered out (aka I want to be able to filter the statistics prior to passing down here)

alamb · 2024-05-31T13:36:37Z

the other obvious thing to do in this PR might be to use a macro to avoid the copy/paste between min_statistics and max_statistics...

So I guess the question is "shall we keep working on this PR or shall we plan to work on it as a follow on PR" 🤔

xinlifoobar · 2024-06-01T22:35:04Z

Thanks again @xinlifoobar -- I went through this PR and I don't think we can merge it yet as is because it has a regression.

The current code will extract the values as the underlying parquet type (e.g. as an Int32Array) and then the predicate pruning code will cast it as necessary

However, when I pushed one more commit to make the types that are not yet handled explicit (rather than relying on a _) I see a bunch of types that aren't yet covered that are important (like Interval for example)

Thus I suggest:

Break out your change to support extracting timestamps with/without timezones as its own PR (it is quite good)

We then work on filling out tests for the other types (I'll file tickets)

Then we can come back to this PR (or maybe we want to work on it in parallel)

I am sorry I didn't see this before and I am sorry that we clearly don't have adequate test coverage

LGTM. It was clear to me what it is missing. Thanks!

This PR is mixed with too many items I think... I will move out some of the code, e.g., for timestamps, later in separated PR.

alamb

I think we have now filled out support for the other data types, so perhaps we can revisit the code in this PR again. What do you think @xinlifoobar ?

alamb · 2024-06-04T20:41:16Z

Converting to draft as we figure out next steps

xinlifoobar · 2024-06-05T03:56:53Z

I think we have now filled out support for the other data types, so perhaps we can revisit the code in this PR again. What do you think @xinlifoobar ?

Hey @alamb, I just republished this PR with some major changes including:

Add types we missed in previous there.
Play around with the paste! macro which I found it may useful here to remove the duplicate codes. (Feel free to revert the change though).

I also play around the make_stats_iterator and make_decimal_stats_iterator with the paste macro... which made the code difficult to read. Here I recommend keeping them separately and clean. What do you think?

xinlifoobar · 2024-06-05T04:07:40Z

Seems there is a CI issue, not related to this PR.

alamb

Looks grat to me -- thank you @xinlifoobar

Dandandan · 2024-06-05T14:49:35Z

Thank you @xinlifoobar and @alamb 🚀

… extracting statistics from parquet files (apache#10711) * Fix incorrect statistics read for unsigned integers columns in parquet * Staging the change for faster stat * Improve performance of extracting statistics from parquet files * Revert "Improve performance of extracting statistics from parquet files" This reverts commit 2faec57. * Revert "Staging the change for faster stat" This reverts commit 095ac39. * Refine using the iterator idea * Add the rest types * Consolidate Decimal statistics extraction * clippy * Simplify * Fix dictionary type * Fix incorrect statistics read for timestamp columns in parquet * Add exhaustive match * Update latest datatypes * fix bad comment * Remove duplications using paste * Fix comment * Update Cargo.lock * fix docs --------- Co-authored-by: Andrew Lamb <[email protected]>

xinlifoobar added 4 commits May 29, 2024 10:21

Fix incorrect statistics read for unsigned integers columns in parquet

3f7076a

Merge branch 'main' of github.com:apache/datafusion into dev/xinli/st…

69f293f

…at_perf

Staging the change for faster stat

095ac39

Improve performance of extracting statistics from parquet files

2faec57

github-actions bot added the core Core DataFusion crate label May 29, 2024

xinlifoobar changed the title ~~Try to Improve performance of extracting statistics from parquet files~~ Avoid the Usage of Intermediate ScalarValue to Improve performance of extracting statistics from parquet files May 29, 2024

xinlifoobar changed the title ~~Avoid the Usage of Intermediate ScalarValue to Improve performance of extracting statistics from parquet files~~ Avoid the usage of intermediate ScalarValue to improve performance of extracting statistics from parquet files May 29, 2024

alamb reviewed May 29, 2024

View reviewed changes

alamb mentioned this pull request May 29, 2024

RFC: Prototype statistics extraction iterators #10715

Closed

xinlifoobar added 4 commits May 30, 2024 09:04

Revert "Improve performance of extracting statistics from parquet files"

b2ee805

This reverts commit 2faec57.

Revert "Staging the change for faster stat"

8d7e9f2

This reverts commit 095ac39.

Refine using the iterator idea

5a3fdec

Add the rest types

9f552e8

alamb reviewed May 30, 2024

View reviewed changes

alamb added 3 commits May 30, 2024 13:36

Consolidate Decimal statistics extraction

7f16956

clippy

26f5431

Merge remote-tracking branch 'apache/main' into dev/xinli/stat_perf

b309912

alamb previously approved these changes May 30, 2024

View reviewed changes

Simplify

860cb00

alamb mentioned this pull request May 30, 2024

Minor: Add tests for extracting dictionary parquet statistics #10729

Merged

Fix dictionary type

524dc2f

Fix incorrect statistics read for timestamp columns in parquet

a788045

alamb mentioned this pull request Jun 3, 2024

DataFusion weekly project plan (Andrew Lamb) - June 3, 2024 #10779

Closed

8 tasks

alamb reviewed Jun 4, 2024

View reviewed changes

alamb marked this pull request as draft June 4, 2024 20:41

xinlifoobar added 4 commits June 5, 2024 02:30

Merge remote-tracking branch 'origin' into dev/xinli/stat_perf

844b2a5

Update latest datatypes

c6aa31c

fix bad comment

4313fff

Remove duplications using paste

4662432

xinlifoobar marked this pull request as ready for review June 5, 2024 03:51

Fix comment

1bcc5e3

xinlifoobar and others added 3 commits June 5, 2024 07:30

Merge remote-tracking branch 'origin' into dev/xinli/stat_perf

d7e2b55

Merge remote-tracking branch 'apache/main' into dev/xinli/stat_perf

1723daa

Update Cargo.lock

fbfb611

alamb approved these changes Jun 5, 2024

View reviewed changes

fix docs

0dd201a

Dandandan approved these changes Jun 5, 2024

View reviewed changes

Dandandan merged commit 9845e6e into apache:main Jun 5, 2024
25 checks passed

This was referenced Jun 5, 2024

Extract Parquet statistics from Interval column #10801

Merged

DataFusion weekly project plan (Andrew Lamb) - June 10, 2024 #10869

Closed

LorrensP-2158466 mentioned this pull request Jun 20, 2024

Support Boolean Parquet Data Page Statistics #11027

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid the usage of intermediate ScalarValue to improve performance of extracting statistics from parquet files #10711

Avoid the usage of intermediate ScalarValue to improve performance of extracting statistics from parquet files #10711

xinlifoobar commented May 29, 2024 •

edited

Loading

alamb left a comment

xinlifoobar commented May 30, 2024

alamb left a comment

alamb May 30, 2024

alamb May 30, 2024

xinlifoobar May 31, 2024

alamb left a comment

alamb May 30, 2024

alamb commented May 30, 2024

alamb commented May 30, 2024

xinlifoobar commented May 31, 2024 •

edited

Loading

xinlifoobar commented May 31, 2024 •

edited

Loading

alamb commented May 31, 2024

alamb commented May 31, 2024

xinlifoobar commented Jun 1, 2024

alamb left a comment

alamb commented Jun 4, 2024

xinlifoobar commented Jun 5, 2024

xinlifoobar commented Jun 5, 2024

alamb left a comment

Dandandan commented Jun 5, 2024

Avoid the usage of intermediate ScalarValue to improve performance of extracting statistics from parquet files #10711

Avoid the usage of intermediate ScalarValue to improve performance of extracting statistics from parquet files #10711

Conversation

xinlifoobar commented May 29, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

xinlifoobar commented May 30, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb May 30, 2024

Choose a reason for hiding this comment

alamb May 30, 2024

Choose a reason for hiding this comment

xinlifoobar May 31, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb May 30, 2024

Choose a reason for hiding this comment

alamb commented May 30, 2024

alamb commented May 30, 2024

xinlifoobar commented May 31, 2024 • edited Loading

xinlifoobar commented May 31, 2024 • edited Loading

alamb commented May 31, 2024

alamb commented May 31, 2024

xinlifoobar commented Jun 1, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jun 4, 2024

xinlifoobar commented Jun 5, 2024

xinlifoobar commented Jun 5, 2024

alamb left a comment

Choose a reason for hiding this comment

Dandandan commented Jun 5, 2024

xinlifoobar commented May 29, 2024 •

edited

Loading

xinlifoobar commented May 31, 2024 •

edited

Loading

xinlifoobar commented May 31, 2024 •

edited

Loading