Optimize GroupBy expressions with no aggregate functions. #8779

nicktobey · 2025-01-22T19:58:57Z

We have a special optimization for SELECT COUNT(*) FROM table_name, which results in the following plan:

Project
  | ├─ columns: [count(1)]
  | └─ Project
  | ├─ columns: [table_name.COUNT(1) as COUNT(1)]
  | └─ table_count(table_name) as COUNT(1)

We're able to compute this efficiently because each tree node stores the total number of leaf nodes under it (aka its "TreeCount"), so we can simply read the TreeCount out of the table's root node.

However, we don't apply similar optimizations for related queries:

Counting with an index prefix

DESCRIBE PLAN SELECT COUNT(*) FROM table_name where pk1 = 1

Project
  | ├─ columns: [count(1)]
  | └─ GroupBy
  | ├─ SelectedExprs(COUNT(1))
  | ├─ Grouping()
  | └─ Filter
  | ├─ (table_name.pk1 = 1)
  | └─ IndexedTableAccess(table_name)
  | ├─ index: [table_name.pk1,table_name.pk2,table_name.pk3,table_name.pk4]
  | ├─ filters: [{[1, 1], [NULL, ∞), [NULL, ∞), [NULL, ∞)}]
  | └─ columns: [pk1]

We end up iterating over every matching row in order to count them, but this is excessive: if an intermediate node has all of its children within the range, we should be able to just add that node's TreeCount instead.

Counting groups

DESCRIBE PLAN SELECT pk1 FROM filenames GROUP BY (pk1`);

GroupBy
  | ├─ SelectedExprs(table_name.pk1)
  | ├─ Grouping(table_name.pk1)
  | └─ Table
  | ├─ name: filenames
  | └─ columns: [pk1]

It's less obvious from looking at the plan that we're still iterating over every row in the table, but we are. This is also excessive: we're trying to count the number of different values of pk1, but if an intermediate node only has children with a single value, we should be able to skip walking that node.

The Fix?

In both cases, evaluating the query is unnecessarily slow if there are a large number of rows where the value being grouped / filtered on is the same. This is because we're iterating over entire sections of the table, even though we don't actually care about any columns that aren't in the key prefix. Those values are only usable in aggregate functions, but there are no aggregate functions in the query.

In general, if a query has an explicit or implicit GROUP BY, but there are no aggregate functions, then it should be possible to compute the result by doing an intelligent walk on the table's tree structure, without needing to iterate over the individual table rows.

The text was updated successfully, but these errors were encountered:

nicktobey added the sql Issue with SQL label Jan 22, 2025

nicktobey changed the title ~~Avoid iterating over entire table in GroupBy expressions when not needed.~~ Optimize GroupBy expressions with no aggregate functions. Jan 22, 2025

timsehn added the performance label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize GroupBy expressions with no aggregate functions. #8779

Optimize GroupBy expressions with no aggregate functions. #8779

nicktobey commented Jan 22, 2025

Optimize GroupBy expressions with no aggregate functions. #8779

Optimize GroupBy expressions with no aggregate functions. #8779

Comments

nicktobey commented Jan 22, 2025

Counting with an index prefix

Counting groups

The Fix?