Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Common Subexpression Elimination for PhysicalExpr trees #13046

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Oct 21, 2024

Which issue does this PR close?

Part of #12599.

Rationale for this change

As described in #12599, there is a CSE rule for logical plans already, but some projects create physical plans directly that could benefit from physical CSE.

What changes are included in this PR?

This PR:

  • Adds EliminateCommonPhysicalSubexprs rule to eliminate common subtrees for Arc<dyn PhysicalExpr> trees. This initial implementation targets ProjectionExec nodes only. Follow-up PR can add support for other nodes like aggregates and filters.
  • Adds DynHashNode trait and implements it for PhysicalExprs.
  • Contains some code cleanup in CommonSubexprEliminate rule.

Are these changes tested?

Added new UTs.

Are there any user-facing changes?

No.

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Oct 21, 2024
@peter-toth
Copy link
Contributor Author

cc @alamb, @andygrove

@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Oct 21, 2024
@andygrove andygrove changed the title Add Common Subexpression Elimination for PsysicalExpr trees Add Common Subexpression Elimination for PhysicalExpr trees Oct 22, 2024
@andygrove
Copy link
Member

Thanks @peter-toth. I plan on testing this out with Comet.

// The EliminateCommonPhysicalSubExprs rule extracts common physical
// subexpression trees into a `ProjectionExec` node under the actual node to
// calculate the common values only once.
Arc::new(EliminateCommonPhysicalSubexprs::new()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any examples of where this rule does eliminations for a plan that has been optimized by the corresponding logical rule?

Because if not it not clear to me why it should be part of the default/recommended list of rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the usecase is for systems (like Comet) that don't use the LogicalPlanner.

That being said, I think the question of "should it be run by default" is a good one -- and I agree with your assertion if we can't find an example where this helps a plan we should probably not enable it by default.

FYI @andygrove

Copy link
Contributor Author

@peter-toth peter-toth Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this rule is useful only for Comet and similar usecases. Let me remove it from the default optimizer: d2529ce

@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Oct 23, 2024
type CommonNodes<'n, N> = IndexMap<Identifier<'n, N>, (N, String)>;
/// A list that contains the common [`TreeNode`]s and their alias, extracted during the
/// second, rewriting traversal.
type CommonNodes<'n, N> = Vec<(N, String)>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this change of type from IndexMap to Vec in CommonNodes is that physical columns works with indexes rather than names. E.g. when we repace a common subexpression to a column during rewrite, we need both the name and the index of the common subexpression in the intermediate ProjectionExec node. Storing the index in NodeStats and using Vec in CommonNodes better fits this usecase.

@github-actions github-actions bot removed the core Core DataFusion crate label Nov 6, 2024
@alamb
Copy link
Contributor

alamb commented Nov 27, 2024

@andygrove -- since you asked for this feature, can you help figure out what to do with this PR?

@peter-toth peter-toth force-pushed the physicalexpr-cse branch 2 times, most recently from 2d27ff7 to d6fbb30 Compare November 28, 2024 08:20
@peter-toth
Copy link
Contributor Author

@andygrove , I've rebased my PR on latest main if you want to give it a try.

@alamb
Copy link
Contributor

alamb commented Jan 16, 2025

@andygrove are you still interested in this feature / reviewing this PR?

@alamb alamb marked this pull request as draft January 29, 2025 15:12
@alamb
Copy link
Contributor

alamb commented Jan 29, 2025

Marking as draft as I am not sure what is happeing with this PR and I am trying to work down the review queue

@andygrove
Copy link
Member

@andygrove are you still interested in this feature / reviewing this PR?

@peter-toth @alamb Apologies, I am still interested in this but other priorities came up and I have not had time to review. I will try and get to this soon.

@peter-toth
Copy link
Contributor Author

@andygrove, no problem, I will try to update this PR from main soon.

@peter-toth peter-toth force-pushed the physicalexpr-cse branch 3 times, most recently from 921f417 to eee2ee5 Compare February 9, 2025 15:00
@peter-toth
Copy link
Contributor Author

peter-toth commented Feb 9, 2025

@andygrove , I've updated the PR from main.
Please note that the new EliminateCommonPhysicalSubexprs rule is not part of the default PhysicalOptimizer as the rule has no use in standard DataFusion. If you would like to use this rule in Comet then you explicitely need to add the rule the optimizer.
Also, this is a simplified version of the logical rule, it handles ProjectExec nodes only and doesn't normalize expressions during elimination (e.g. ProjectExec [a + 1, 1 + a] will not be eliminated for now). But if you think the rule is useful I'm happy to port all the features of the logical rule in follow-up PRs.

@peter-toth peter-toth marked this pull request as ready for review February 9, 2025 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate optimizer Optimizer rules physical-expr Physical Expressions proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants