Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize GroupBy expressions with no aggregate functions. #8779

Open
nicktobey opened this issue Jan 22, 2025 · 0 comments
Open

Optimize GroupBy expressions with no aggregate functions. #8779

nicktobey opened this issue Jan 22, 2025 · 0 comments
Labels
performance sql Issue with SQL

Comments

@nicktobey
Copy link
Contributor

We have a special optimization for SELECT COUNT(*) FROM table_name, which results in the following plan:

Project
  | ├─ columns: [count(1)]
  | └─ Project
  | ├─ columns: [table_name.COUNT(1) as COUNT(1)]
  | └─ table_count(table_name) as COUNT(1)

We're able to compute this efficiently because each tree node stores the total number of leaf nodes under it (aka its "TreeCount"), so we can simply read the TreeCount out of the table's root node.

However, we don't apply similar optimizations for related queries:

Counting with an index prefix

DESCRIBE PLAN SELECT COUNT(*) FROM table_name where pk1 = 1

Project
  | ├─ columns: [count(1)]
  | └─ GroupBy
  | ├─ SelectedExprs(COUNT(1))
  | ├─ Grouping()
  | └─ Filter
  | ├─ (table_name.pk1 = 1)
  | └─ IndexedTableAccess(table_name)
  | ├─ index: [table_name.pk1,table_name.pk2,table_name.pk3,table_name.pk4]
  | ├─ filters: [{[1, 1], [NULL, ∞), [NULL, ∞), [NULL, ∞)}]
  | └─ columns: [pk1]

We end up iterating over every matching row in order to count them, but this is excessive: if an intermediate node has all of its children within the range, we should be able to just add that node's TreeCount instead.

Counting groups

DESCRIBE PLAN SELECT pk1 FROM filenames GROUP BY (pk1`);

GroupBy
  | ├─ SelectedExprs(table_name.pk1)
  | ├─ Grouping(table_name.pk1)
  | └─ Table
  | ├─ name: filenames
  | └─ columns: [pk1]

It's less obvious from looking at the plan that we're still iterating over every row in the table, but we are. This is also excessive: we're trying to count the number of different values of pk1, but if an intermediate node only has children with a single value, we should be able to skip walking that node.

The Fix?

In both cases, evaluating the query is unnecessarily slow if there are a large number of rows where the value being grouped / filtered on is the same. This is because we're iterating over entire sections of the table, even though we don't actually care about any columns that aren't in the key prefix. Those values are only usable in aggregate functions, but there are no aggregate functions in the query.

In general, if a query has an explicit or implicit GROUP BY, but there are no aggregate functions, then it should be possible to compute the result by doing an intelligent walk on the table's tree structure, without needing to iterate over the individual table rows.

@nicktobey nicktobey added the sql Issue with SQL label Jan 22, 2025
@nicktobey nicktobey changed the title Avoid iterating over entire table in GroupBy expressions when not needed. Optimize GroupBy expressions with no aggregate functions. Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance sql Issue with SQL
Projects
None yet
Development

No branches or pull requests

2 participants