Implement groupby yielding #46

spenczar · 2023-08-21T23:38:31Z

We repeatedly have found ourselves doing something like this:

def group_by_obscode(self):
    unique_codes = self.obscode.unique()
    for c in unique_codes:
        yield self.apply_mask(pc.equal(self.obscode, c))

Of course, this is rather inefficient, since it recomputes the mask on every iteration of the loop. For a dimension with high cardinality, this is unnecessarily slow.

In some real-world data (the unique exposure times of a month of NSC data), I got a 4x speedup on the above by using much lower-level pyarrow primitives.

It could look something like this:

def group_by_column(self, column_name: str):
    groups = self.table.group_by(column_name).aggregate([(field.name, "list") for field in data.schema])
    for i, key in enumerate(groups[0]):
        tablet = pa.Table.from_arrays([groups[j+1][i].values for j in range(len(list(data.schema)))], schema=self.schema)
        # Probably extra care needed here for metadata
        yield self.from_pyarrow(tablet)

This is pretty gnarly internals, but it's wicked fast.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement groupby yielding #46

Implement groupby yielding #46

spenczar commented Aug 21, 2023 •

edited

Loading

Implement groupby yielding #46

Implement groupby yielding #46

Comments

spenczar commented Aug 21, 2023 • edited Loading

spenczar commented Aug 21, 2023 •

edited

Loading