[FEAT] Add wildcards in column expressions (#2629)

Allows using wildcards in columns to match multiple columns, as well as subfields in structs: `col("*")`, `col("mystruct.*")` Helps address #1965, but does not support partial matches of names. I believe this should be done separately using a different expression like `regexcol()` to avoid column naming conflicts. Examples: ```py >>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> df.select("*").show() ╭───────┬───────╮ │ a ┆ b │ │ --- ┆ --- │ │ Int64 ┆ Int64 │ ╞═══════╪═══════╡ │ 1 ┆ 4 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 2 ┆ 5 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 3 ┆ 6 │ ╰───────┴───────╯ >>> df.select(col("*").sqrt()).show() ╭────────────────────┬───────────────────╮ │ a ┆ b │ │ --- ┆ --- │ │ Float64 ┆ Float64 │ ╞════════════════════╪═══════════════════╡ │ 1 ┆ 2 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 1.4142135623730951 ┆ 2.23606797749979 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ │ 1.7320508075688772 ┆ 2.449489742783178 │ ╰────────────────────┴───────────────────╯ ``` ```py >>> df = daft.from_pydict({"a": [{"b": 1, "c": 4}, {"b": 2, "c": 5}]}) >>> df.select("a.*").show() ╭───────┬───────╮ │ b ┆ c │ │ --- ┆ --- │ │ Int64 ┆ Int64 │ ╞═══════╪═══════╡ │ 1 ┆ 4 │ ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤ │ 2 ┆ 5 │ ╰───────┴───────╯ ```
Eventual-Inc · Aug 9, 2024 · 9562cc6 · 9562cc6
1 parent e288ed5
commit 9562cc6
Show file tree

Hide file tree

Showing 13 changed files with 952 additions and 290 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/daft/dataframe/dataframe.py b/daft/dataframe/dataframe.py
@@ -982,6 +982,10 @@ def __getitem__(self, item: Union[slice, int, str, Iterable[Union[str, int]]]) -
             return result
         elif isinstance(item, str):
             schema = self._builder.schema()
+            if (item == "*" or item.endswith(".*")) and item not in schema.column_names():
+                # does not account for weird column names
+                # like if struct "a" has a field named "*", then a.* will wrongly fail
+                raise ValueError("Wildcard expressions are not supported in DataFrame.__getitem__")
             expr, _ = resolve_expr(col(item)._expr, schema._schema)
             return Expression._from_pyexpr(expr)
         elif isinstance(item, Iterable):

diff --git a/daft/expressions/expressions.py b/daft/expressions/expressions.py
@@ -124,7 +124,9 @@ def lit(value: object) -> Expression:
 
 
 def col(name: str) -> Expression:
-    """Creates an Expression referring to the column with the provided name
+    """Creates an Expression referring to the column with the provided name.
+
+    See :ref:`Column Wildcards` for details on wildcards.
 
     Example:
         >>> import daft

diff --git a/docs/source/user_guide/basic_concepts/expressions.rst b/docs/source/user_guide/basic_concepts/expressions.rst
@@ -42,6 +42,33 @@ You may also find it necessary in certain situations to create an Expression wit
 
 When this Expression is evaluated, it will resolve to "the column named A" in whatever evaluation context it is used within!
 
+Refer to multiple columns using a wildcard
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+You can create expressions on multiple columns at once using a wildcard. The expression `col("*")` selects every column in a DataFrame, and you can operate on this expression in the same way as a single column:
+
+.. code:: python
+
+    import daft
+    from daft import col
+
+    df = daft.from_pydict({"A": [1, 2, 3], "B": [4, 5, 6]})
+    df.select(col("*") * 3).show()
+
+.. code:: none
+
+    ╭───────┬───────╮
+    │ A     ┆ B     │
+    │ ---   ┆ ---   │
+    │ Int64 ┆ Int64 │
+    ╞═══════╪═══════╡
+    │ 3     ┆ 12    │
+    ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
+    │ 6     ┆ 15    │
+    ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
+    │ 9     ┆ 18    │
+    ╰───────┴───────╯
+
 Literals
 ^^^^^^^^
 

diff --git a/docs/source/user_guide/daft_in_depth/dataframe-operations.rst b/docs/source/user_guide/daft_in_depth/dataframe-operations.rst
@@ -94,6 +94,64 @@ As we have already seen in previous guides, adding a new column can be achieved
     +---------+---------+---------+
     (Showing first 3 rows)
 
+.. _Column Wildcards:
+
+Selecting Columns Using Wildcards
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+We can select multiple columns at once using wildcards. The expression `col("*")` selects every column in a DataFrame, and you can operate on this expression in the same way as a single column:
+
+.. code:: python
+
+    df = daft.from_pydict({"A": [1, 2, 3], "B": [4, 5, 6]})
+    df.select(col("*") * 3).show()
+
+.. code:: none
+
+    ╭───────┬───────╮
+    │ A     ┆ B     │
+    │ ---   ┆ ---   │
+    │ Int64 ┆ Int64 │
+    ╞═══════╪═══════╡
+    │ 3     ┆ 12    │
+    ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
+    │ 6     ┆ 15    │
+    ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
+    │ 9     ┆ 18    │
+    ╰───────┴───────╯
+
+We can also select multiple columns within structs using `col("struct.*")`:
+
+.. code:: python
+
+    df = daft.from_pydict({
+        "A": [
+            {"B": 1, "C": 2},
+            {"B": 3, "C": 4}
+        ]
+    })
+    df.select(col("A.*")).show()
+
+.. code:: none
+
+    ╭───────┬───────╮
+    │ B     ┆ C     │
+    │ ---   ┆ ---   │
+    │ Int64 ┆ Int64 │
+    ╞═══════╪═══════╡
+    │ 1     ┆ 2     │
+    ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
+    │ 3     ┆ 4     │
+    ╰───────┴───────╯
+
+Under the hood, wildcards work by finding all of the columns that match, then copying the expression several times and replacing the wildcard. This means that there are some caveats:
+
+* Only one wildcard is allowed per expression tree. This means that `col("*") + col("*")` and similar expressions do not work.
+* Be conscious about duplicated column names. Any code like `df.select(col("*"), col("*") + 3)` will not work because the wildcards expand into the same column names.
+
+  For the same reason, `col("A") + col("*")` will not work because the name on the left-hand side is inherited, meaning all the output columns are named `A`, causing an error if there is more than one.
+  However, `col("*") + col("A")` will work fine.
+
 Selecting Rows
 --------------
 

diff --git a/src/daft-dsl/Cargo.toml b/src/daft-dsl/Cargo.toml
@@ -6,6 +6,7 @@ common-treenode = {path = "../common/treenode", default-features = false}
 daft-core = {path = "../daft-core", default-features = false}
 daft-sketch = {path = "../daft-sketch", default-features = false}
 itertools = {workspace = true}
+log = {workspace = true}
 pyo3 = {workspace = true, optional = true}
 serde = {workspace = true}
 serde_json = {workspace = true}