Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] Improve hash table probe side decisions for Swordfish #3327

Merged

Conversation

desmondcheongzx
Copy link
Contributor

@desmondcheongzx desmondcheongzx commented Nov 18, 2024

This PR lifts statistics into optimized logical plans so that they're available for local execution plans. It then uses these newly available statistics to make better decisions on whether to build the probe table of a hash join on the left or right side.

Benchmark results

For TPC-H, this gives us some notable speedups with Q5, Q8, and Q19.

  • Q5: ~2.2x (in memory), ~1.5 (parquet)
  • Q8: ~5.9x (in memory), ~2.2x (parquet)
  • Q19: ~7x (in memory), ~4.5x (parquet)

Crucially, with this change, our native runner is now faster (or within some small deviation) than our previous python runner for all 22 TPC-H queries.

For more detailed results, we have:

Q5

Before

--------------------------------------------------------------------------------- benchmark 'q5-parts-1': 4 tests ---------------------------------------------------------------------------------
Name (time in ms)                       Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-python-5]     28.7213 (1.0)      30.3664 (1.0)      29.4645 (1.0)      0.3849 (1.0)      29.4894 (1.0)      0.5529 (1.0)          10;0  33.9391 (1.0)          33           1
test_tpch[1-in-memory-native-5]     30.9980 (1.08)     34.1489 (1.12)     32.2151 (1.09)     0.7150 (1.86)     32.2903 (1.09)     1.1586 (2.10)         10;0  31.0413 (0.91)         31           1
test_tpch[1-parquet-python-5]       48.8010 (1.70)     51.8535 (1.71)     50.0985 (1.70)     0.8342 (2.17)     50.0400 (1.70)     1.4193 (2.57)          9;0  19.9607 (0.59)         20           1
test_tpch[1-parquet-native-5]       51.1122 (1.78)     54.0755 (1.78)     52.3799 (1.78)     0.8317 (2.16)     52.4268 (1.78)     1.2526 (2.27)          8;0  19.0913 (0.56)         20           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After

--------------------------------------------------------------------------------- benchmark 'q5-parts-1': 4 tests ---------------------------------------------------------------------------------
Name (time in ms)                       Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-native-5]     13.8393 (1.0)      16.2712 (1.0)      14.4352 (1.0)      0.5322 (2.13)     14.3212 (1.0)      0.5084 (1.71)          9;5  69.2750 (1.0)          62           1
test_tpch[1-in-memory-python-5]     28.3915 (2.05)     29.5304 (1.81)     28.8477 (2.00)     0.2501 (1.0)      28.8299 (2.01)     0.2971 (1.0)          10;1  34.6649 (0.50)         35           1
test_tpch[1-parquet-native-5]       34.3952 (2.49)     36.4230 (2.24)     35.4314 (2.45)     0.4543 (1.82)     35.3678 (2.47)     0.5042 (1.70)          8;1  28.2235 (0.41)         28           1
test_tpch[1-parquet-python-5]       55.7339 (4.03)     57.3051 (3.52)     56.4564 (3.91)     0.5101 (2.04)     56.2507 (3.93)     0.7570 (2.55)          4;0  17.7128 (0.26)         18           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q8

Before

--------------------------------------------------------------------------------- benchmark 'q8-parts-1': 4 tests ---------------------------------------------------------------------------------
Name (time in ms)                       Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-python-8]     14.9532 (1.0)      17.2992 (1.0)      16.1906 (1.0)      0.5962 (1.22)     16.1370 (1.0)      0.8323 (1.13)         23;0  61.7642 (1.0)          60           1
test_tpch[1-parquet-python-8]       34.7310 (2.32)     52.5183 (3.04)     37.1660 (2.30)     3.2354 (6.61)     36.5454 (2.26)     2.2469 (3.04)          1;1  26.9063 (0.44)         28           1
test_tpch[1-in-memory-native-8]     44.0259 (2.94)     46.0576 (2.66)     45.0905 (2.78)     0.4898 (1.0)      45.0528 (2.79)     0.7380 (1.0)           5;0  22.1776 (0.36)         22           1
test_tpch[1-parquet-native-8]       69.8245 (4.67)     73.1332 (4.23)     71.0333 (4.39)     0.8421 (1.72)     70.8515 (4.39)     0.9827 (1.33)          3;1  14.0779 (0.23)         14           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After

--------------------------------------------------------------------------------- benchmark 'q8-parts-1': 4 tests ----------------------------------------------------------------------------------
Name (time in ms)                       Min                Max               Mean            StdDev             Median               IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-native-8]      7.2145 (1.0)       8.5953 (1.0)       7.7980 (1.0)      0.3170 (1.0)       7.8074 (1.0)      0.4959 (1.13)         44;0  128.2373 (1.0)         118           1
test_tpch[1-in-memory-python-8]     15.3458 (2.13)     17.6777 (2.06)     16.3485 (2.10)     0.5217 (1.65)     16.5234 (2.12)     0.7077 (1.62)         17;0   61.1678 (0.48)         57           1
test_tpch[1-parquet-native-8]       31.8017 (4.41)     33.5568 (3.90)     32.4955 (4.17)     0.4402 (1.39)     32.4145 (4.15)     0.4376 (1.0)           9;2   30.7735 (0.24)         30           1
test_tpch[1-parquet-python-8]       46.2703 (6.41)     48.3345 (5.62)     47.6753 (6.11)     0.5142 (1.62)     47.7092 (6.11)     0.4509 (1.03)          6;2   20.9752 (0.16)         21           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q19

Before

------------------------------------------------------------------------------------ benchmark 'q19-parts-1': 4 tests -----------------------------------------------------------------------------------
Name (time in ms)                         Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-python-19]     277.9533 (1.0)      281.5647 (1.0)      279.8556 (1.0)       1.4607 (1.0)      279.4186 (1.0)       2.2466 (1.0)           2;0  3.5733 (1.0)           5           1
test_tpch[1-parquet-python-19]       311.1196 (1.12)     317.8069 (1.13)     315.6200 (1.13)      2.6611 (1.82)     316.7545 (1.13)      2.8849 (1.28)          1;0  3.1684 (0.89)          5           1
test_tpch[1-in-memory-native-19]     431.2738 (1.55)     464.2194 (1.65)     442.1488 (1.58)     13.3136 (9.11)     436.8320 (1.56)     16.3197 (7.26)          1;0  2.2617 (0.63)          5           1
test_tpch[1-parquet-native-19]       455.3492 (1.64)     460.8460 (1.64)     458.0333 (1.64)      2.1169 (1.45)     457.4410 (1.64)      3.0005 (1.34)          2;0  2.1832 (0.61)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

After

------------------------------------------------------------------------------------ benchmark 'q19-parts-1': 4 tests -----------------------------------------------------------------------------------
Name (time in ms)                         Min                 Max                Mean             StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_tpch[1-in-memory-native-19]      62.6100 (1.0)       71.2192 (1.0)       66.0757 (1.0)       2.5634 (1.0)       65.6719 (1.0)      1.5398 (1.0)           6;4  15.1342 (1.0)          15           1
test_tpch[1-parquet-native-19]        94.8984 (1.52)     134.7584 (1.89)     103.1099 (1.56)     12.5712 (4.90)      97.7583 (1.49)     7.3370 (4.76)          1;1   9.6984 (0.64)          9           1
test_tpch[1-in-memory-python-19]     284.6653 (4.55)     295.5558 (4.15)     289.7268 (4.38)      3.9986 (1.56)     288.6982 (4.40)     4.6399 (3.01)          2;0   3.4515 (0.23)          5           1
test_tpch[1-parquet-python-19]       308.9599 (4.93)     319.0440 (4.48)     314.7801 (4.76)      4.2088 (1.64)     315.3274 (4.80)     7.0198 (4.56)          2;0   3.1768 (0.21)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

@desmondcheongzx desmondcheongzx force-pushed the lift-stats-to-logical-plan branch 20 times, most recently from d399c3d to acce197 Compare November 20, 2024 06:07
@desmondcheongzx desmondcheongzx changed the title [DRAFT] Lift stats to logical plan [PERF] Improve hash table probe side decisions for Swordfish Nov 20, 2024
@github-actions github-actions bot added the performance (legacy) Please use the "perf" label instead label Nov 20, 2024
@desmondcheongzx desmondcheongzx force-pushed the lift-stats-to-logical-plan branch from acce197 to 4d6d6d6 Compare November 20, 2024 21:19
Copy link

codspeed-hq bot commented Nov 20, 2024

CodSpeed Performance Report

Merging #3327 will degrade performances by 53.75%

Comparing desmondcheongzx:lift-stats-to-logical-plan (5a495ae) with main (4031637)

Summary

⚡ 2 improvements
❌ 4 regressions
✅ 11 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main desmondcheongzx:lift-stats-to-logical-plan Change
test_explain[100 Small Files] 7.3 ms 8.4 ms -13.11%
test_iter_rows_first_row[100 Small Files] 174.9 ms 263.2 ms -33.53%
test_show[100 Small Files] 15.5 ms 33.6 ms -53.75%
test_tpch[1-in-memory-native-5] 381.9 ms 314.5 ms +21.43%
test_tpch[1-in-memory-native-8] 339 ms 156.3 ms ×2.2
test_tpch[1-in-memory-native-9] 367.2 ms 451 ms -18.58%

src/daft-local-execution/src/pipeline.rs Outdated Show resolved Hide resolved
src/common/scan-info/src/lib.rs Outdated Show resolved Hide resolved
src/daft-stats/src/partition_spec.rs Outdated Show resolved Hide resolved
src/parquet2/src/metadata/column_chunk_metadata.rs Outdated Show resolved Hide resolved
src/daft-scan/src/lib.rs Outdated Show resolved Hide resolved
Copy link

codecov bot commented Nov 26, 2024

Codecov Report

Attention: Patch coverage is 83.83743% with 171 lines in your changes missing coverage. Please review.

Project coverage is 76.48%. Comparing base (11c46a9) to head (5a495ae).
Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-logical-plan/src/stats.rs 73.49% 22 Missing ⚠️
src/daft-local-plan/src/plan.rs 79.22% 16 Missing ⚠️
src/daft-logical-plan/src/ops/concat.rs 51.85% 13 Missing ⚠️
src/daft-scan/src/anonymous.rs 0.00% 10 Missing ⚠️
src/daft-scan/src/lib.rs 81.13% 10 Missing ⚠️
src/daft-logical-plan/src/logical_plan.rs 84.21% 9 Missing ⚠️
src/daft-table/src/lib.rs 0.00% 8 Missing ⚠️
src/common/scan-info/src/lib.rs 46.15% 7 Missing ⚠️
src/daft-logical-plan/src/ops/source.rs 87.93% 7 Missing ⚠️
src/daft-physical-plan/src/plan.rs 60.00% 6 Missing ⚠️
... and 27 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3327      +/-   ##
==========================================
- Coverage   77.35%   76.48%   -0.87%     
==========================================
  Files         684      685       +1     
  Lines       83633    85728    +2095     
==========================================
+ Hits        64693    65568     +875     
- Misses      18940    20160    +1220     
Files with missing lines Coverage Δ
daft/dataframe/dataframe.py 86.89% <ø> (+0.23%) ⬆️
src/common/daft-config/src/lib.rs 81.69% <100.00%> (ø)
src/common/scan-info/src/scan_operator.rs 50.00% <ø> (ø)
src/common/scan-info/src/scan_task.rs 42.85% <100.00%> (+42.85%) ⬆️
src/daft-catalog/src/lib.rs 70.29% <100.00%> (ø)
src/daft-connect/src/op/execute/root.rs 95.65% <100.00%> (ø)
...local-execution/src/sinks/outer_hash_join_probe.rs 96.96% <100.00%> (+0.13%) ⬆️
src/daft-logical-plan/src/builder.rs 91.66% <100.00%> (ø)
src/daft-logical-plan/src/display.rs 98.06% <100.00%> (ø)
src/daft-logical-plan/src/lib.rs 100.00% <ø> (ø)
... and 48 more

... and 45 files with indirect coverage changes

daft/runners/native_runner.py Outdated Show resolved Hide resolved
daft/dataframe/dataframe.py Outdated Show resolved Hide resolved
src/daft-local-execution/src/pipeline.rs Outdated Show resolved Hide resolved
let scan_tasks = match &physical_scan_info.scan_state {
ScanState::Operator(scan_op) => scan_op
.0
.to_scan_tasks(physical_scan_info.pushdowns.clone(), execution_config)?,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid execution_config completely in the logical planner. This would also mean that the logical plan needs to know what execution engine we are targeting which breaks the invariant of the logical plan.

Right now to_scan_tasks performs generates the scan tasks and then also run the scan task passes. But the implementation of that looks identical for the 3 ScanOperator impls that we have.

My proposal is that we remove the scan task passes from the to_scan_tasks for GlobScanOperator, AnonymousScanOperator and the PythonScanOperator and then run it during translation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

I did this, with a few caveats:

  • At translation time, we have already converted the concrete scan tasks into Arc<Vec<Arc<dyn ScanTaskLike>>>
  • I implemented ScanTaskLike.split() as a trait method, then modified split_by_row_groups to use this trait method on our trait objects
  • Unfortunately I could not do the same for merging because merging requires some data source and partitioning information that, if pulled into ScanTaskLike would create circular dependencies.
  • Instead, since the only non-test implementer of ScanTaskLike is ScanTask, for merging I downcasted the Arc<dyn ScanTaskLike>s into Arc<ScanTask>, then kept the old implementation of merging.

Please let me know if this works for you or if it's a little sus!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some feedback on this! Happy to pair up tomorrow to crank this out

@desmondcheongzx desmondcheongzx force-pushed the lift-stats-to-logical-plan branch 2 times, most recently from a997839 to 72d6f34 Compare November 27, 2024 04:37
@desmondcheongzx desmondcheongzx force-pushed the lift-stats-to-logical-plan branch from 72d6f34 to 152c8d4 Compare November 27, 2024 04:44
@desmondcheongzx desmondcheongzx removed the request for review from kevinzwang November 27, 2024 23:01
src/daft-scan/src/scan_task_iters.rs Outdated Show resolved Hide resolved
@desmondcheongzx desmondcheongzx enabled auto-merge (squash) November 28, 2024 01:17
@desmondcheongzx desmondcheongzx merged commit f2d4f73 into Eventual-Inc:main Nov 28, 2024
39 of 40 checks passed
@desmondcheongzx desmondcheongzx deleted the lift-stats-to-logical-plan branch November 28, 2024 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance (legacy) Please use the "perf" label instead
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants