feat(table): implement list_unique and Set aggregation #3710

f4t4nt · 2025-01-17T19:20:04Z

Implements agg_set aggregation expression requested in #3661. This aggregation is similar to agg_list but only keeps distinct elements in the resulting list.

Key features:

agg_set(): Creates a list of distinct elements in a group
- Similar interface to existing agg_list()
- Automatically deduplicates non-null elements
- Preserves null lists (when entire list is null)
Implementation details:
- Built on top of agg_list() and new list_unique() operation
- Handles both regular and fixed-size lists
- Maintains order of first occurrence for each unique element

Example usage:

# Similar to agg_list() but with distinct elements
df.groupby("key").agg([
    df["values"].agg_set()  # Creates list of unique elements per group
])

# Before: agg_list()
df.agg([col("values").agg_list()])
# {"values": [[1, None, 2, 2, 1]]}

# After: agg_set()
df.agg([col("values").agg_set()])
# {"values": [[1, 2]]}  # Nulls within lists are excluded

codspeed-hq · 2025-01-17T19:32:04Z

CodSpeed Performance Report

Merging #3710 will not alter performance

_{Comparing nishant-set-agg (aa31d66) with main (63ffd5e)}

Summary

✅ 27 untouched benchmarks

codecov · 2025-01-17T19:47:49Z

Codecov Report

Attention: Patch coverage is 89.16155% with 53 lines in your changes missing coverage. Please review.

Project coverage is 77.79%. Comparing base (a8d63dd) to head (aa31d66).
Report is 14 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daft-core/src/array/ops/set_agg.rs	93.04%	16 Missing ⚠️
src/daft-functions/src/list/unique.rs	85.43%	15 Missing ⚠️
daft/dataframe/dataframe.py	46.15%	7 Missing ⚠️
src/daft-core/src/series/ops/hash.rs	55.55%	4 Missing ⚠️
src/daft-core/src/python/series.rs	0.00%	3 Missing ⚠️
src/daft-logical-plan/src/ops/project.rs	0.00%	3 Missing ⚠️
daft/expressions/expressions.py	88.88%	1 Missing ⚠️
...rc/daft-core/src/series/array_impl/nested_array.rs	90.00%	1 Missing ⚠️
src/daft-core/src/series/mod.rs	97.87%	1 Missing ⚠️
src/daft-dsl/src/expr/mod.rs	88.88%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3710      +/-   ##
==========================================
+ Coverage   77.58%   77.79%   +0.20%     
==========================================
  Files         729      735       +6     
  Lines       92035    93060    +1025     
==========================================
+ Hits        71409    72398     +989     
- Misses      20626    20662      +36

Files with missing lines	Coverage Δ
src/daft-core/src/series/array_impl/data_array.rs	`96.33% <100.00%> (+0.37%)`	⬆️
...c/daft-core/src/series/array_impl/logical_array.rs	`95.16% <100.00%> (+0.71%)`	⬆️
src/daft-core/src/series/ops/agg.rs	`78.02% <100.00%> (+0.36%)`	⬆️
src/daft-dsl/src/python.rs	`91.23% <100.00%> (+0.16%)`	⬆️
src/daft-functions/src/python/list.rs	`100.00% <100.00%> (ø)`
src/daft-functions/src/python/mod.rs	`100.00% <100.00%> (ø)`
...ft-physical-plan/src/physical_planner/translate.rs	`94.18% <100.00%> (+0.16%)`	⬆️
src/daft-table/src/lib.rs	`84.27% <100.00%> (+0.12%)`	⬆️
daft/expressions/expressions.py	`93.96% <88.88%> (+0.32%)`	⬆️
...rc/daft-core/src/series/array_impl/nested_array.rs	`70.09% <90.00%> (+5.14%)`	⬆️
... and 9 more

... and 69 files with indirect coverage changes

…e.rs and cleanup style

…ramework

- Add hash-based deduplication in set_agg.rs - Add proper null value handling in set operations - Add debug logging for set operations - Improve error messaging for unhashable types - Update tests to verify set aggregation behavior

- Remove daft-functions from daft-table dependencies - Remove daft-functions/python from python feature

kevinzwang

Looks good so far. Two general notes in addition to my comments:

Should list_unique be called list_distinct instead? It aligns better with the naming of DataFrame.distinct function. The naming in other engines vary, but PySpark and DuckDB do call it distinct.
Does it make sense to have ignore_nulls at all? I feel like the behavior of ignore_nulls=False is ambiguous, not sure if is sensible to keep a single null value if there are any nulls anyway. Might just be better to say that these functions simply discard all the nulls and have the user deal with nulls themselves. Apologies for bringing this up after you've done all this work already

kevinzwang · 2025-01-30T23:18:06Z

daft/expressions/expressions.py

@@ -982,6 +983,11 @@ def agg_list(self) -> Expression:
        expr = self._expr.agg_list()
        return Expression._from_pyexpr(expr)

+    def agg_set(self, ignore_nulls: bool = True) -> Expression:
+        """Aggregates the values in the expression into a set."""


Add some examples here, especially with behavior on nulls

kevinzwang · 2025-01-30T23:19:07Z

daft/expressions/expressions.py

+            (Showing first 3 of 3 rows)
+
+        Args:
+            ignore_nulls: Whether to ignore null values in the result. Defaults to True.


Maybe add some examples here too. Does ignore_nulls=True mean nulls are considered the same or unique? I'm okay with either but good to clarify in the docs

docs/sphinx/source/expressions.rst

src/daft-core/src/array/ops/set_agg.rs

kevinzwang · 2025-01-31T23:47:53Z

src/daft-core/src/array/ops/set_agg.rs

+            if ignore_nulls {
+                let not_null_mask = DaftNotNull::not_null(self)?.into_series();
+                child_series = child_series.filter(not_null_mask.bool()?)?;
+            }


An alternative to this is to let deduplicate_series handle the null filtering. It's going through all of the elements anyway.

kevinzwang · 2025-02-01T00:12:43Z

src/daft-functions/src/list/unique.rs

+        }
+    }
+
+    fn evaluate(&self, inputs: &[Series]) -> DaftResult<Series> {


I would move the implementation of evaluate (the stuff inside the match) out of this file and into maybe daft-core/src/array/ops to match the other functions. This file is kind of like the definition (naming, typing, etc) of the function, instead of the actual implementation.

kevinzwang · 2025-02-01T00:15:16Z

src/daft-functions/src/list/unique.rs

+                    let fixed_list = input.fixed_size_list()?;
+                    let arrow_array = fixed_list.to_arrow();
+                    let list_type = ArrowDataType::List(Box::new(arrow2::datatypes::Field::new(
+                        "item",
+                        inner_type.as_ref().to_arrow()?,
+                        true,
+                    )));
+                    let list_array = cast(arrow_array.as_ref(), &list_type, Default::default())?;
+                    Series::try_from_field_and_arrow_array(
+                        Field::new(input.name(), DataType::List(inner_type.clone())),
+                        list_array,
+                    )?


I believe you can straight up cast a FixedSizeList type series into a List type instead of implementing this

kevinzwang · 2025-02-01T00:17:40Z

src/daft-functions/src/list/unique.rs

+                for (i, series) in result.iter().enumerate() {
+                    growable.extend(i, 0, series.len());
+                }


The way you're using the growable here still requires two copies of the data, once to construct result and once at growable.build(). Happy to chat about how you could do this more efficiently!

kevinzwang · 2025-02-01T00:24:50Z

src/daft-physical-plan/src/physical_planner/translate.rs

+                let list_agg_id =
+                    add_to_stage(AggExpr::List, e.clone(), schema, &mut first_stage_aggs);
+                let list_concat_id = add_to_stage(
+                    AggExpr::Concat,
+                    col(list_agg_id.clone()),
+                    schema,
+                    &mut second_stage_aggs,
+                );
+                let result = unique(col(list_concat_id.clone()), *ignore_nulls).alias(output_name);
+                final_exprs.push(result);


Come to think of it, since the translation here converts it into a two-stage aggregation that uses list and then concat+unique, are we ever actually using Series::agg_set during an execution of agg_set? If we don't intend to, there's actually no need to implement that at all.

It is pretty confusing how the expression that is called is not always the one that we actually execute in an agg. Let me know if you need me to explain any of this.

kevinzwang · 2025-02-01T00:25:08Z

src/daft-table/src/lib.rs

+            // AggExpr::Set(expr) => {
+            //     let list = self.eval_expression(expr)?.agg_list(groups)?;
+            //     let unique_expr = unique(col(list.name()), false);
+            //     self.eval_expression(&unique_expr)
+            // }


…uplication

github-actions bot added the feat label Jan 17, 2025

f4t4nt linked an issue Jan 17, 2025 that may be closed by this pull request

agg_set aggregation expression #3661

Open

f4t4nt force-pushed the nishant-set-agg branch from 71f24f1 to a232665 Compare January 29, 2025 00:57

f4t4nt added 13 commits January 28, 2025 17:03

feat(table): implement list_unique and Set aggregation

bfddcb9

feat(table): implement list_unique and Set aggregation

43e5437

fix(table): fix confusion between unique_count and unique in translat…

35638fd

…e.rs and cleanup style

test(table): add comprehensive tests for agg_set with include_nulls

c1b9e9b

feat(agg): implement set aggregation deferring to list agg

149a1fc

feat(aggs): add initial set aggregation implementation and fix test f…

47b607a

…ramework

feat(core): implement hash-based set aggregation

b30aef9

- Add hash-based deduplication in set_agg.rs - Add proper null value handling in set operations - Add debug logging for set operations - Improve error messaging for unhashable types - Update tests to verify set aggregation behavior

chore(deps): remove unused daft-functions dependency

a358aa9

- Remove daft-functions from daft-table dependencies - Remove daft-functions/python from python feature

fix(set_agg): fix null handling in set aggregation

fb6c0ea

fix(set_agg): fix null handling in set aggregation (2)

a5cfdf7

chore: minor fixes in concat_agg, translate, and lib

7433a68

chore: minor re-fix in concat_agg

81bdbf3

refactor: change include_nulls to ignore_nulls in set aggregation

7512dd4

f4t4nt force-pushed the nishant-set-agg branch from a232665 to 7512dd4 Compare January 29, 2025 01:03

f4t4nt added 2 commits January 28, 2025 17:25

fix(concat_agg): revert to original implementation

ad80914

refactor(tests): remove print statements and streamline test code

58b4ff3

f4t4nt requested a review from kevinzwang January 29, 2025 22:27

f4t4nt added 3 commits January 29, 2025 17:26

refactor(list): optimize unique operation and improve test clarity

c2e5bb0

style(list): remove redundant comments

e81f795

test(aggs): remove tests that error groupby behavior rather than agg_set

4485b8b

kevinzwang reviewed Feb 1, 2025

View reviewed changes

f4t4nt added 2 commits February 2, 2025 16:14

docs(expressions): add list.unique to documentation

01cddd5

fix(set_agg): use data equality instead of only hash equality for ded…

aa31d66

…uplication

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(table): implement list_unique and Set aggregation #3710

feat(table): implement list_unique and Set aggregation #3710

f4t4nt commented Jan 17, 2025

codspeed-hq bot commented Jan 17, 2025 •

edited

Loading

codecov bot commented Jan 17, 2025 •

edited

Loading

kevinzwang left a comment

kevinzwang Jan 30, 2025

kevinzwang Jan 30, 2025

kevinzwang Jan 31, 2025

kevinzwang Feb 1, 2025

kevinzwang Feb 1, 2025

kevinzwang Feb 1, 2025

kevinzwang Feb 1, 2025

kevinzwang Feb 1, 2025

kevinzwang Feb 1, 2025

feat(table): implement list_unique and Set aggregation #3710

Are you sure you want to change the base?

feat(table): implement list_unique and Set aggregation #3710

Conversation

f4t4nt commented Jan 17, 2025

codspeed-hq bot commented Jan 17, 2025 • edited Loading

Merging #3710 will not alter performance

Summary

codecov bot commented Jan 17, 2025 • edited Loading

Codecov Report

kevinzwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codspeed-hq bot commented Jan 17, 2025 •

edited

Loading

codecov bot commented Jan 17, 2025 •

edited

Loading