[PERF] Improve hash table probe side decisions for Swordfish #3327

desmondcheongzx · 2024-11-18T21:57:18Z

This PR lifts statistics into optimized logical plans so that they're available for local execution plans. It then uses these newly available statistics to make better decisions on whether to build the probe table of a hash join on the left or right side.

Benchmark results

For TPC-H, this gives us some notable speedups with Q5, Q8, and Q19.

Q5: ~2.2x (in memory), ~1.5 (parquet)
Q8: ~5.9x (in memory), ~2.2x (parquet)
Q19: ~7x (in memory), ~4.5x (parquet)

Crucially, with this change, our native runner is now faster (or within some small deviation) than our previous python runner for all 22 TPC-H queries.

For more detailed results, we have:

Q5

Before

--------------------------------------------------------------------------------- benchmark 'q5-parts-1': 4 tests ---------------------------------------------------------------------------------
Name (time in ms)                       Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-python-5]     28.7213 (1.0)      30.3664 (1.0)      29.4645 (1.0)      0.3849 (1.0)      29.4894 (1.0)      0.5529 (1.0)          10;0  33.9391 (1.0)          33           1
test_tpch[1-in-memory-native-5]     30.9980 (1.08)     34.1489 (1.12)     32.2151 (1.09)     0.7150 (1.86)     32.2903 (1.09)     1.1586 (2.10)         10;0  31.0413 (0.91)         31           1
test_tpch[1-parquet-python-5]       48.8010 (1.70)     51.8535 (1.71)     50.0985 (1.70)     0.8342 (2.17)     50.0400 (1.70)     1.4193 (2.57)          9;0  19.9607 (0.59)         20           1
test_tpch[1-parquet-native-5]       51.1122 (1.78)     54.0755 (1.78)     52.3799 (1.78)     0.8317 (2.16)     52.4268 (1.78)     1.2526 (2.27)          8;0  19.0913 (0.56)         20           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After

--------------------------------------------------------------------------------- benchmark 'q5-parts-1': 4 tests ---------------------------------------------------------------------------------
Name (time in ms)                       Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-native-5]     13.8393 (1.0)      16.2712 (1.0)      14.4352 (1.0)      0.5322 (2.13)     14.3212 (1.0)      0.5084 (1.71)          9;5  69.2750 (1.0)          62           1
test_tpch[1-in-memory-python-5]     28.3915 (2.05)     29.5304 (1.81)     28.8477 (2.00)     0.2501 (1.0)      28.8299 (2.01)     0.2971 (1.0)          10;1  34.6649 (0.50)         35           1
test_tpch[1-parquet-native-5]       34.3952 (2.49)     36.4230 (2.24)     35.4314 (2.45)     0.4543 (1.82)     35.3678 (2.47)     0.5042 (1.70)          8;1  28.2235 (0.41)         28           1
test_tpch[1-parquet-python-5]       55.7339 (4.03)     57.3051 (3.52)     56.4564 (3.91)     0.5101 (2.04)     56.2507 (3.93)     0.7570 (2.55)          4;0  17.7128 (0.26)         18           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q8

Before

--------------------------------------------------------------------------------- benchmark 'q8-parts-1': 4 tests ---------------------------------------------------------------------------------
Name (time in ms)                       Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-python-8]     14.9532 (1.0)      17.2992 (1.0)      16.1906 (1.0)      0.5962 (1.22)     16.1370 (1.0)      0.8323 (1.13)         23;0  61.7642 (1.0)          60           1
test_tpch[1-parquet-python-8]       34.7310 (2.32)     52.5183 (3.04)     37.1660 (2.30)     3.2354 (6.61)     36.5454 (2.26)     2.2469 (3.04)          1;1  26.9063 (0.44)         28           1
test_tpch[1-in-memory-native-8]     44.0259 (2.94)     46.0576 (2.66)     45.0905 (2.78)     0.4898 (1.0)      45.0528 (2.79)     0.7380 (1.0)           5;0  22.1776 (0.36)         22           1
test_tpch[1-parquet-native-8]       69.8245 (4.67)     73.1332 (4.23)     71.0333 (4.39)     0.8421 (1.72)     70.8515 (4.39)     0.9827 (1.33)          3;1  14.0779 (0.23)         14           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After

--------------------------------------------------------------------------------- benchmark 'q8-parts-1': 4 tests ----------------------------------------------------------------------------------
Name (time in ms)                       Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-native-8]      7.2145 (1.0)       8.5953 (1.0)       7.7980 (1.0)      0.3170 (1.0)       7.8074 (1.0)      0.4959 (1.13)         44;0  128.2373 (1.0)         118           1
test_tpch[1-in-memory-python-8]     15.3458 (2.13)     17.6777 (2.06)     16.3485 (2.10)     0.5217 (1.65)     16.5234 (2.12)     0.7077 (1.62)         17;0   61.1678 (0.48)         57           1
test_tpch[1-parquet-native-8]       31.8017 (4.41)     33.5568 (3.90)     32.4955 (4.17)     0.4402 (1.39)     32.4145 (4.15)     0.4376 (1.0)           9;2   30.7735 (0.24)         30           1
test_tpch[1-parquet-python-8]       46.2703 (6.41)     48.3345 (5.62)     47.6753 (6.11)     0.5142 (1.62)     47.7092 (6.11)     0.4509 (1.03)          6;2   20.9752 (0.16)         21           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q19

Before

------------------------------------------------------------------------------------ benchmark 'q19-parts-1': 4 tests -----------------------------------------------------------------------------------
Name (time in ms)                         Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-python-19]     277.9533 (1.0)      281.5647 (1.0)      279.8556 (1.0)       1.4607 (1.0)      279.4186 (1.0)       2.2466 (1.0)           2;0  3.5733 (1.0)           5           1
test_tpch[1-parquet-python-19]       311.1196 (1.12)     317.8069 (1.13)     315.6200 (1.13)      2.6611 (1.82)     316.7545 (1.13)      2.8849 (1.28)          1;0  3.1684 (0.89)          5           1
test_tpch[1-in-memory-native-19]     431.2738 (1.55)     464.2194 (1.65)     442.1488 (1.58)     13.3136 (9.11)     436.8320 (1.56)     16.3197 (7.26)          1;0  2.2617 (0.63)          5           1
test_tpch[1-parquet-native-19]       455.3492 (1.64)     460.8460 (1.64)     458.0333 (1.64)      2.1169 (1.45)     457.4410 (1.64)      3.0005 (1.34)          2;0  2.1832 (0.61)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After

------------------------------------------------------------------------------------ benchmark 'q19-parts-1': 4 tests -----------------------------------------------------------------------------------
Name (time in ms)                         Min                 Max                Mean             StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-native-19]      62.6100 (1.0)       71.2192 (1.0)       66.0757 (1.0)       2.5634 (1.0)       65.6719 (1.0)      1.5398 (1.0)           6;4  15.1342 (1.0)          15           1
test_tpch[1-parquet-native-19]        94.8984 (1.52)     134.7584 (1.89)     103.1099 (1.56)     12.5712 (4.90)      97.7583 (1.49)     7.3370 (4.76)          1;1   9.6984 (0.64)          9           1
test_tpch[1-in-memory-python-19]     284.6653 (4.55)     295.5558 (4.15)     289.7268 (4.38)      3.9986 (1.56)     288.6982 (4.40)     4.6399 (3.01)          2;0   3.4515 (0.23)          5           1
test_tpch[1-parquet-python-19]       308.9599 (4.93)     319.0440 (4.48)     314.7801 (4.76)      4.2088 (1.64)     315.3274 (4.80)     7.0198 (4.56)          2;0   3.1768 (0.21)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

codspeed-hq · 2024-11-20T21:26:31Z

CodSpeed Performance Report

Merging #3327 will degrade performances by 53.75%

_{Comparing desmondcheongzx:lift-stats-to-logical-plan (5a495ae) with main (4031637)}

Summary

⚡ 2 improvements
❌ 4 regressions
✅ 11 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`desmondcheongzx:lift-stats-to-logical-plan`	Change
❌	`test_explain[100 Small Files]`	7.3 ms	8.4 ms	-13.11%
❌	`test_iter_rows_first_row[100 Small Files]`	174.9 ms	263.2 ms	-33.53%
❌	`test_show[100 Small Files]`	15.5 ms	33.6 ms	-53.75%
⚡	`test_tpch[1-in-memory-native-5]`	381.9 ms	314.5 ms	+21.43%
⚡	`test_tpch[1-in-memory-native-8]`	339 ms	156.3 ms	×2.2
❌	`test_tpch[1-in-memory-native-9]`	367.2 ms	451 ms	-18.58%

src/daft-local-execution/src/pipeline.rs

src/common/scan-info/src/lib.rs

src/daft-stats/src/partition_spec.rs

src/parquet2/src/metadata/column_chunk_metadata.rs

src/daft-scan/src/lib.rs

codecov · 2024-11-26T02:09:03Z

Codecov Report

Attention: Patch coverage is 83.83743% with 171 lines in your changes missing coverage. Please review.

Project coverage is 76.48%. Comparing base (11c46a9) to head (5a495ae).
Report is 22 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daft-logical-plan/src/stats.rs	73.49%	22 Missing ⚠️
src/daft-local-plan/src/plan.rs	79.22%	16 Missing ⚠️
src/daft-logical-plan/src/ops/concat.rs	51.85%	13 Missing ⚠️
src/daft-scan/src/anonymous.rs	0.00%	10 Missing ⚠️
src/daft-scan/src/lib.rs	81.13%	10 Missing ⚠️
src/daft-logical-plan/src/logical_plan.rs	84.21%	9 Missing ⚠️
src/daft-table/src/lib.rs	0.00%	8 Missing ⚠️
src/common/scan-info/src/lib.rs	46.15%	7 Missing ⚠️
src/daft-logical-plan/src/ops/source.rs	87.93%	7 Missing ⚠️
src/daft-physical-plan/src/plan.rs	60.00%	6 Missing ⚠️
... and 27 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3327      +/-   ##
==========================================
- Coverage   77.35%   76.48%   -0.87%     
==========================================
  Files         684      685       +1     
  Lines       83633    85728    +2095     
==========================================
+ Hits        64693    65568     +875     
- Misses      18940    20160    +1220

Files with missing lines	Coverage Δ
daft/dataframe/dataframe.py	`86.89% <ø> (+0.23%)`	⬆️
src/common/daft-config/src/lib.rs	`81.69% <100.00%> (ø)`
src/common/scan-info/src/scan_operator.rs	`50.00% <ø> (ø)`
src/common/scan-info/src/scan_task.rs	`42.85% <100.00%> (+42.85%)`	⬆️
src/daft-catalog/src/lib.rs	`70.29% <100.00%> (ø)`
src/daft-connect/src/op/execute/root.rs	`95.65% <100.00%> (ø)`
...local-execution/src/sinks/outer_hash_join_probe.rs	`96.96% <100.00%> (+0.13%)`	⬆️
src/daft-logical-plan/src/builder.rs	`91.66% <100.00%> (ø)`
src/daft-logical-plan/src/display.rs	`98.06% <100.00%> (ø)`
src/daft-logical-plan/src/lib.rs	`100.00% <ø> (ø)`
... and 48 more

... and 45 files with indirect coverage changes

daft/runners/native_runner.py

daft/dataframe/dataframe.py

src/daft-local-execution/src/pipeline.rs

samster25 · 2024-11-26T09:14:37Z

src/daft-logical-plan/src/ops/source.rs

+                let scan_tasks = match &physical_scan_info.scan_state {
+                    ScanState::Operator(scan_op) => scan_op
+                        .0
+                        .to_scan_tasks(physical_scan_info.pushdowns.clone(), execution_config)?,


I think we should avoid execution_config completely in the logical planner. This would also mean that the logical plan needs to know what execution engine we are targeting which breaks the invariant of the logical plan.

Right now to_scan_tasks performs generates the scan tasks and then also run the scan task passes. But the implementation of that looks identical for the 3 ScanOperator impls that we have.

My proposal is that we remove the scan task passes from the to_scan_tasks for GlobScanOperator, AnonymousScanOperator and the PythonScanOperator and then run it during translation.

Sounds good!

I did this, with a few caveats:

At translation time, we have already converted the concrete scan tasks into Arc<Vec<Arc<dyn ScanTaskLike>>>

I implemented ScanTaskLike.split() as a trait method, then modified split_by_row_groups to use this trait method on our trait objects

Unfortunately I could not do the same for merging because merging requires some data source and partitioning information that, if pulled into ScanTaskLike would create circular dependencies.

Instead, since the only non-test implementer of ScanTaskLike is ScanTask, for merging I downcasted the Arc<dyn ScanTaskLike>s into Arc<ScanTask>, then kept the old implementation of merging.

Please let me know if this works for you or if it's a little sus!

Left some feedback on this! Happy to pair up tomorrow to crank this out

src/daft-physical-plan/src/physical_planner/translate.rs

src/common/scan-info/src/scan_task.rs

src/daft-scan/src/scan_task_iters.rs

desmondcheongzx force-pushed the lift-stats-to-logical-plan branch 20 times, most recently from d399c3d to acce197 Compare November 20, 2024 06:07

desmondcheongzx changed the title ~~[DRAFT] Lift stats to logical plan~~ [PERF] Improve hash table probe side decisions for Swordfish Nov 20, 2024

github-actions bot added the performance (legacy) Please use the "perf" label instead label Nov 20, 2024

desmondcheongzx force-pushed the lift-stats-to-logical-plan branch from acce197 to 4d6d6d6 Compare November 20, 2024 21:19

Implement

9510233

desmondcheongzx force-pushed the lift-stats-to-logical-plan branch from 4d6d6d6 to 9510233 Compare November 20, 2024 22:25

desmondcheongzx requested review from samster25, kevinzwang and colin-ho November 20, 2024 22:30

Display stats when explaining optimized logical plan

4543a5b

desmondcheongzx added 3 commits November 23, 2024 01:18

Clarify two pass join algorithm

68324a9

Add probe side size threshold

5f3714b

Add execution config to py logical plan builder

4a75b1b

samster25 reviewed Nov 23, 2024

View reviewed changes

desmondcheongzx added 10 commits November 25, 2024 10:27

Merge remote-tracking branch 'daft' into lift-stats-to-logical-plan

31ef97a

Ignore execution config for local execution

24db81a

Change variable name for left_smaller_than_right

987a028

Remove changes to parquet2

c57a787

Clean up hash for partition spec

5eafa86

Use Arcs to hold vectors of scan task like refs

5b7829b

Pass execution config directly on optimize

291fad4

No default arguments for you

7ec69c0

Properly pass up errors on materialize scan failures

2c8f17e

Only hash hashable pyobjects

f34e59c

samster25 reviewed Nov 26, 2024

View reviewed changes

desmondcheongzx force-pushed the lift-stats-to-logical-plan branch 2 times, most recently from a997839 to 72d6f34 Compare November 27, 2024 04:37

Move scan task splitting and merging to physical translation time

152c8d4

desmondcheongzx force-pushed the lift-stats-to-logical-plan branch from 72d6f34 to 152c8d4 Compare November 27, 2024 04:44

Reuse the broadcast join size threshold

15f51d7

samster25 reviewed Nov 27, 2024

View reviewed changes

src/common/scan-info/src/scan_task.rs Show resolved Hide resolved

src/common/scan-info/src/scan_task.rs Outdated Show resolved Hide resolved

desmondcheongzx removed the request for review from kevinzwang November 27, 2024 23:01

desmondcheongzx added 2 commits November 27, 2024 15:59

Remove daft-scan dependency from daft-physical-plan

e6f14ce

Forward declare scan task split and merge

55382f3

samster25 approved these changes Nov 28, 2024

View reviewed changes

src/daft-scan/src/scan_task_iters.rs Outdated Show resolved Hide resolved

Remove is_scan_task()

5a495ae

desmondcheongzx enabled auto-merge (squash) November 28, 2024 01:17

desmondcheongzx merged commit f2d4f73 into Eventual-Inc:main Nov 28, 2024
39 of 40 checks passed

desmondcheongzx deleted the lift-stats-to-logical-plan branch November 28, 2024 01:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] Improve hash table probe side decisions for Swordfish #3327

[PERF] Improve hash table probe side decisions for Swordfish #3327

desmondcheongzx commented Nov 18, 2024 •

edited

Loading

codspeed-hq bot commented Nov 20, 2024 •

edited

Loading

codecov bot commented Nov 26, 2024 •

edited

Loading

samster25 Nov 26, 2024

desmondcheongzx Nov 27, 2024

samster25 Nov 27, 2024

[PERF] Improve hash table probe side decisions for Swordfish #3327

[PERF] Improve hash table probe side decisions for Swordfish #3327

Conversation

desmondcheongzx commented Nov 18, 2024 • edited Loading

Benchmark results

Q5

Before

After

Q8

Before

After

Q19

Before

After

codspeed-hq bot commented Nov 20, 2024 • edited Loading

CodSpeed Performance Report

Merging #3327 will degrade performances by 53.75%

Summary

Benchmarks breakdown

codecov bot commented Nov 26, 2024 • edited Loading

Codecov Report

samster25 Nov 26, 2024

Choose a reason for hiding this comment

desmondcheongzx Nov 27, 2024

Choose a reason for hiding this comment

samster25 Nov 27, 2024

Choose a reason for hiding this comment

desmondcheongzx commented Nov 18, 2024 •

edited

Loading

codspeed-hq bot commented Nov 20, 2024 •

edited

Loading

codecov bot commented Nov 26, 2024 •

edited

Loading