You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using Dask to parallelize the Lark tree generation, but the multi-threaded client of map_partitions doesn't seem to speed up anything. Consider using multiple processes (simply switching causes weird bugs), or using @delayed function calls.
The text was updated successfully, but these errors were encountered:
Using PySpark didn't help. Another way is to first generate grammar files for each field in /tmp, and the function could then spawn Lark objects in each process. Since text -> Lark is not cheap, we can consider:
Try to partition the data frame based on the fields. This way each process only needs to generate Lark objects for fields it needs to parse.
Re-use Lark objects in each process. This requires modifying text_to_tree() such that it takes a new argument parser.
Using Dask to parallelize the Lark tree generation, but the multi-threaded client of
map_partitions
doesn't seem to speed up anything. Consider using multiple processes (simply switching causes weird bugs), or using@delayed
function calls.The text was updated successfully, but these errors were encountered: