Skip to content

Commit

Permalink
[FEAT] Add wildcards in column expressions (#2629)
Browse files Browse the repository at this point in the history
Allows using wildcards in columns to match multiple columns, as well as
subfields in structs: `col("*")`, `col("mystruct.*")`

Helps address #1965, but does not support partial matches of names. I
believe this should be done separately using a different expression like
`regexcol()` to avoid column naming conflicts.

Examples:
```py
>>> df = daft.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> df.select("*").show()
╭───────┬───────╮
│ a     ┆ b     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3     ┆ 6     │
╰───────┴───────╯
>>> df.select(col("*").sqrt()).show()
╭────────────────────┬───────────────────╮ 
│ a                  ┆ b                 │
│ ---                ┆ ---               │
│ Float64            ┆ Float64           │
╞════════════════════╪═══════════════════╡
│ 1                  ┆ 2                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.4142135623730951 ┆ 2.23606797749979  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.7320508075688772 ┆ 2.449489742783178 │
╰────────────────────┴───────────────────╯
```

```py
>>> df = daft.from_pydict({"a": [{"b": 1, "c": 4}, {"b": 2, "c": 5}]})
>>> df.select("a.*").show()
╭───────┬───────╮
│ b     ┆ c     │
│ ---   ┆ ---   │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2     ┆ 5     │
╰───────┴───────╯
```
  • Loading branch information
Vince7778 authored Aug 9, 2024
1 parent e288ed5 commit 9562cc6
Show file tree
Hide file tree
Showing 13 changed files with 952 additions and 290 deletions.
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions daft/dataframe/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -982,6 +982,10 @@ def __getitem__(self, item: Union[slice, int, str, Iterable[Union[str, int]]]) -
return result
elif isinstance(item, str):
schema = self._builder.schema()
if (item == "*" or item.endswith(".*")) and item not in schema.column_names():
# does not account for weird column names
# like if struct "a" has a field named "*", then a.* will wrongly fail
raise ValueError("Wildcard expressions are not supported in DataFrame.__getitem__")
expr, _ = resolve_expr(col(item)._expr, schema._schema)
return Expression._from_pyexpr(expr)
elif isinstance(item, Iterable):
Expand Down
4 changes: 3 additions & 1 deletion daft/expressions/expressions.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,9 @@ def lit(value: object) -> Expression:


def col(name: str) -> Expression:
"""Creates an Expression referring to the column with the provided name
"""Creates an Expression referring to the column with the provided name.

See :ref:`Column Wildcards` for details on wildcards.

Example:
>>> import daft
Expand Down
27 changes: 27 additions & 0 deletions docs/source/user_guide/basic_concepts/expressions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,33 @@ You may also find it necessary in certain situations to create an Expression wit

When this Expression is evaluated, it will resolve to "the column named A" in whatever evaluation context it is used within!

Refer to multiple columns using a wildcard
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can create expressions on multiple columns at once using a wildcard. The expression `col("*")` selects every column in a DataFrame, and you can operate on this expression in the same way as a single column:

.. code:: python

import daft
from daft import col

df = daft.from_pydict({"A": [1, 2, 3], "B": [4, 5, 6]})
df.select(col("*") * 3).show()

.. code:: none

╭───────┬───────╮
│ A ┆ B │
│ --- ┆ --- │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 3 ┆ 12 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6 ┆ 15 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 9 ┆ 18 │
╰───────┴───────╯

Literals
^^^^^^^^

Expand Down
58 changes: 58 additions & 0 deletions docs/source/user_guide/daft_in_depth/dataframe-operations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,64 @@ As we have already seen in previous guides, adding a new column can be achieved
+---------+---------+---------+
(Showing first 3 rows)

.. _Column Wildcards:

Selecting Columns Using Wildcards
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We can select multiple columns at once using wildcards. The expression `col("*")` selects every column in a DataFrame, and you can operate on this expression in the same way as a single column:

.. code:: python

df = daft.from_pydict({"A": [1, 2, 3], "B": [4, 5, 6]})
df.select(col("*") * 3).show()

.. code:: none

╭───────┬───────╮
│ A ┆ B │
│ --- ┆ --- │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 3 ┆ 12 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6 ┆ 15 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 9 ┆ 18 │
╰───────┴───────╯

We can also select multiple columns within structs using `col("struct.*")`:

.. code:: python

df = daft.from_pydict({
"A": [
{"B": 1, "C": 2},
{"B": 3, "C": 4}
]
})
df.select(col("A.*")).show()

.. code:: none

╭───────┬───────╮
│ B ┆ C │
│ --- ┆ --- │
│ Int64 ┆ Int64 │
╞═══════╪═══════╡
│ 1 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 4 │
╰───────┴───────╯

Under the hood, wildcards work by finding all of the columns that match, then copying the expression several times and replacing the wildcard. This means that there are some caveats:

* Only one wildcard is allowed per expression tree. This means that `col("*") + col("*")` and similar expressions do not work.
* Be conscious about duplicated column names. Any code like `df.select(col("*"), col("*") + 3)` will not work because the wildcards expand into the same column names.

For the same reason, `col("A") + col("*")` will not work because the name on the left-hand side is inherited, meaning all the output columns are named `A`, causing an error if there is more than one.
However, `col("*") + col("A")` will work fine.

Selecting Rows
--------------

Expand Down
1 change: 1 addition & 0 deletions src/daft-dsl/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ common-treenode = {path = "../common/treenode", default-features = false}
daft-core = {path = "../daft-core", default-features = false}
daft-sketch = {path = "../daft-sketch", default-features = false}
itertools = {workspace = true}
log = {workspace = true}
pyo3 = {workspace = true, optional = true}
serde = {workspace = true}
serde_json = {workspace = true}
Expand Down
Loading

0 comments on commit 9562cc6

Please sign in to comment.