-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(optimizer): Implement greedy join reordering #3538
Open
desmondcheongzx
wants to merge
8
commits into
Eventual-Inc:main
Choose a base branch
from
desmondcheongzx:goo-join-order
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
3d5736a
Implement GOO
desmondcheongzx 8207e1e
Apply projections and filters
desmondcheongzx 8873626
Reimplement num_edges()
desmondcheongzx d450ebf
Reimplement fully_connected()
desmondcheongzx 9f33df3
Sweatin'
desmondcheongzx 9b7886a
Update tests
desmondcheongzx ba6eee2
Cleanup
desmondcheongzx c77c3d1
Cleanup
desmondcheongzx File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
153 changes: 153 additions & 0 deletions
153
src/daft-logical-plan/src/optimization/rules/reorder_joins/greedy_join_order.rs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
use std::{collections::HashMap, sync::Arc}; | ||
|
||
use common_error::DaftResult; | ||
use daft_dsl::{col, ExprRef}; | ||
|
||
use super::join_graph::{JoinCondition, JoinGraph}; | ||
use crate::{LogicalPlanBuilder, LogicalPlanRef}; | ||
|
||
// This is an implementation of the Greedy Operator Ordering algorithm (GOO) [1] for join selection. This algorithm | ||
// selects join edges greedily by picking the edge with the smallest cost at each step. This is similar to Kruskal's | ||
// minimum spanning tree algorithm, with the caveat that edge costs update at each step, due to changing cardinalities | ||
// and selectivities between join nodes. | ||
// | ||
// Compared to DP-based algorithms, GOO is not always optimal. However, GOO has a complexity of O(n^3) and is more viable | ||
// than DP-based algorithms when performing join ordering on many relations. DP Connected subgraph Complement Pairs (DPccp) [2] | ||
// is the DP-based algorithm widely used in database systems today and has a O(3^n) complexity, although the latest | ||
// literature does offer a super-polynomially faster DP-algorithm but that still has a O(2^n) to O(2^n * n^3) complexity [3]. | ||
// | ||
// For this reason, we maintain a greedy-based join ordering algorithm to use when the number of relations is large, and default | ||
// to DP-based algorithms otherwise. | ||
// | ||
// [1]: Fegaras, L. (1998). A New Heuristic for Optimizing Large Queries. International Conference on Database and Expert Systems Applications. | ||
// [2]: Moerkotte, G., & Neumann, T. (2006). Analysis of two existing and one new dynamic programming algorithm for the generation of optimal bushy join trees without cross products. Very Large Data Bases Conference. | ||
// [3]: Stoian, M., & Kipf, A. (2024). DPconv: Super-Polynomially Faster Join Ordering. ArXiv, abs/2409.08013. | ||
pub(crate) struct GreedyJoinOrderer {} | ||
|
||
impl GreedyJoinOrderer { | ||
/// Consumes the join graph and transforms it into a logical plan with joins reordered. | ||
pub(crate) fn compute_join_order(join_graph: &mut JoinGraph) -> DaftResult<LogicalPlanRef> { | ||
// While the join graph consists of more than one join node, select the edge that has the smallest cost, | ||
// then join the left and right nodes connected by this edge. | ||
while join_graph.adj_list.0.len() > 1 { | ||
let selected_pair = GreedyJoinOrderer::find_minimum_cost_join(&join_graph.adj_list.0); | ||
if let Some((left, right, join_conds)) = selected_pair { | ||
// Join the left and right relations using the given join conditions. | ||
let (left_on, right_on) = join_conds | ||
.iter() | ||
.map(|join_cond| { | ||
( | ||
col(join_cond.left_on.clone()), | ||
col(join_cond.right_on.clone()), | ||
) | ||
}) | ||
.collect::<(Vec<ExprRef>, Vec<ExprRef>)>(); | ||
let left_builder = LogicalPlanBuilder::from(left.clone()); | ||
let join = left_builder | ||
.inner_join(right.clone(), left_on, right_on)? | ||
.build(); | ||
let join = Arc::new(Arc::unwrap_or_clone(join).with_materialized_stats()); | ||
|
||
// Add the new node into the adjacency list. | ||
let left_neighbors = join_graph.adj_list.0.remove(&left).unwrap(); | ||
let right_neighbors = join_graph.adj_list.0.remove(&right).unwrap(); | ||
let mut new_join_edges = HashMap::new(); | ||
|
||
// Helper function that takes in neighbors to the left and right nodes, then combines edges that point | ||
// back to the left and/or right nodes into edges that point to the new join node. | ||
let mut update_neighbors = | ||
|neighbors: HashMap<LogicalPlanRef, Vec<JoinCondition>>| { | ||
for (neighbor, _) in neighbors { | ||
if neighbor == right || neighbor == left { | ||
// Skip the nodes that we just joined. | ||
continue; | ||
} | ||
let mut join_conditions = Vec::new(); | ||
// If this neighbor was connected to left or right nodes, collect the join conditions. | ||
let neighbor_edges = join_graph | ||
.adj_list | ||
.0 | ||
.get_mut(&neighbor) | ||
.expect("The neighbor should still be in the join graph"); | ||
if let Some(left_conds) = neighbor_edges.remove(&left) { | ||
join_conditions.extend(left_conds); | ||
} | ||
if let Some(right_conds) = neighbor_edges.remove(&right) { | ||
join_conditions.extend(right_conds); | ||
} | ||
// If this neighbor had any connections to left or right, create a new edge to the new join node. | ||
if !join_conditions.is_empty() { | ||
neighbor_edges.insert(join.clone(), join_conditions.clone()); | ||
new_join_edges.insert( | ||
neighbor.clone(), | ||
join_conditions.iter().map(|cond| cond.flip()).collect(), | ||
); | ||
} | ||
} | ||
}; | ||
|
||
// Process all neighbors from both the left and right sides. | ||
update_neighbors(left_neighbors); | ||
update_neighbors(right_neighbors); | ||
|
||
// Add the new join node and its edges to the graph. | ||
join_graph.adj_list.0.insert(join, new_join_edges); | ||
} else { | ||
panic!( | ||
"No valid join edge selected despite join graph containing more than one relation" | ||
); | ||
Check warning on line 98 in src/daft-logical-plan/src/optimization/rules/reorder_joins/greedy_join_order.rs Codecov / codecov/patchsrc/daft-logical-plan/src/optimization/rules/reorder_joins/greedy_join_order.rs#L96-L98
|
||
} | ||
} | ||
// Apply projections and filters on top of the fully joined plan. | ||
if let Some(joined_plan) = join_graph.adj_list.0.drain().map(|(plan, _)| plan).last() { | ||
join_graph.apply_projections_and_filters_to_plan(joined_plan) | ||
} else { | ||
panic!("No valid logical plan after join reordering") | ||
} | ||
} | ||
|
||
/// Helper functions that finds the next join edge in the adjacency list that has the smallest cost. | ||
/// Currently cost is determined based on the max size in bytes of the candidate left and right relations. | ||
fn find_minimum_cost_join( | ||
adj_list: &HashMap<LogicalPlanRef, HashMap<LogicalPlanRef, Vec<JoinCondition>>>, | ||
) -> Option<(LogicalPlanRef, LogicalPlanRef, Vec<JoinCondition>)> { | ||
let mut min_cost = None; | ||
let mut selected_pair = None; | ||
|
||
for (candidate_left, neighbors) in adj_list { | ||
for (candidate_right, join_conds) in neighbors { | ||
let left_stats = candidate_left.materialized_stats(); | ||
let right_stats = candidate_right.materialized_stats(); | ||
|
||
// Assume primary key foreign key join which would have a size bounded by the foreign key relation, | ||
// which is typically larger. | ||
let cur_cost = left_stats | ||
.approx_stats | ||
.upper_bound_bytes | ||
.max(right_stats.approx_stats.upper_bound_bytes); | ||
|
||
if let Some(existing_min) = min_cost { | ||
if let Some(current) = cur_cost { | ||
if current < existing_min { | ||
min_cost = Some(current); | ||
selected_pair = Some(( | ||
candidate_left.clone(), | ||
candidate_right.clone(), | ||
join_conds.clone(), | ||
)); | ||
} | ||
} | ||
} else { | ||
min_cost = cur_cost; | ||
selected_pair = Some(( | ||
candidate_left.clone(), | ||
candidate_right.clone(), | ||
join_conds.clone(), | ||
)); | ||
} | ||
} | ||
} | ||
|
||
selected_pair | ||
} | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is doing too much (reordering and creating the logical plan) and isn't really that modular.
Instead what I would recommend is the following:
This way you get two things: