Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BRENDA load time #1

Open
y1zhou opened this issue Nov 9, 2021 · 1 comment
Open

Improve BRENDA load time #1

y1zhou opened this issue Nov 9, 2021 · 1 comment
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@y1zhou
Copy link
Owner

y1zhou commented Nov 9, 2021

Using Dask to parallelize the Lark tree generation, but the multi-threaded client of map_partitions doesn't seem to speed up anything. Consider using multiple processes (simply switching causes weird bugs), or using @delayed function calls.

@y1zhou
Copy link
Owner Author

y1zhou commented Dec 3, 2021

Using PySpark didn't help. Another way is to first generate grammar files for each field in /tmp, and the function could then spawn Lark objects in each process. Since text -> Lark is not cheap, we can consider:

  1. Try to partition the data frame based on the fields. This way each process only needs to generate Lark objects for fields it needs to parse.
  2. Re-use Lark objects in each process. This requires modifying text_to_tree() such that it takes a new argument parser.

# Setup Spark
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.master(f"local[{n_jobs}]")
.appName("Brenda")
.config("spark.driver.memory", f"{n_jobs}g")
.getOrCreate()
)
# Parse the text in parallel
sdf = spark.createDataFrame(df)
df["description"] = sdf.rdd.map(
lambda row: text_to_tree(row.description, parsers[row.field])
).collect()
spark.stop()

@y1zhou y1zhou added enhancement New feature or request help wanted Extra attention is needed labels Dec 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant