You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Of course, this is rather inefficient, since it recomputes the mask on every iteration of the loop. For a dimension with high cardinality, this is unnecessarily slow.
In some real-world data (the unique exposure times of a month of NSC data), I got a 4x speedup on the above by using much lower-level pyarrow primitives.
It could look something like this:
defgroup_by_column(self, column_name: str):
groups=self.table.group_by(column_name).aggregate([(field.name, "list") forfieldindata.schema])
fori, keyinenumerate(groups[0]):
tablet=pa.Table.from_arrays([groups[j+1][i].valuesforjinrange(len(list(data.schema)))], schema=self.schema)
# Probably extra care needed here for metadatayieldself.from_pyarrow(tablet)
This is pretty gnarly internals, but it's wicked fast.
The text was updated successfully, but these errors were encountered:
We repeatedly have found ourselves doing something like this:
Of course, this is rather inefficient, since it recomputes the mask on every iteration of the loop. For a dimension with high cardinality, this is unnecessarily slow.
In some real-world data (the unique exposure times of a month of NSC data), I got a 4x speedup on the above by using much lower-level pyarrow primitives.
It could look something like this:
This is pretty gnarly internals, but it's wicked fast.
The text was updated successfully, but these errors were encountered: