Improve BRENDA load time #1

y1zhou · 2021-11-09T21:10:23Z

Using Dask to parallelize the Lark tree generation, but the multi-threaded client of map_partitions doesn't seem to speed up anything. Consider using multiple processes (simply switching causes weird bugs), or using @delayed function calls.

The text was updated successfully, but these errors were encountered:

y1zhou · 2021-12-03T19:33:55Z

Using PySpark didn't help. Another way is to first generate grammar files for each field in /tmp, and the function could then spawn Lark objects in each process. Since text -> Lark is not cheap, we can consider:

Try to partition the data frame based on the fields. This way each process only needs to generate Lark objects for fields it needs to parse.
Re-use Lark objects in each process. This requires modifying text_to_tree() such that it takes a new argument parser.

metabolike/metabolike/parser/neo4j.py

Lines 54 to 70 in 82da899

    
           # Setup Spark 
        
           from pyspark.sql import SparkSession 
        
           spark = ( 
        
               SparkSession.builder.master(f"local[{n_jobs}]") 
        
               .appName("Brenda") 
        
               .config("spark.driver.memory", f"{n_jobs}g") 
        
               .getOrCreate() 
        
           ) 
        
           # Parse the text in parallel 
        
           sdf = spark.createDataFrame(df) 
        
           df["description"] = sdf.rdd.map( 
        
               lambda row: text_to_tree(row.description, parsers[row.field]) 
        
           ).collect() 
        
           spark.stop()

y1zhou added enhancement New feature or request help wanted Extra attention is needed labels Dec 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve BRENDA load time #1

Improve BRENDA load time #1

y1zhou commented Nov 9, 2021

y1zhou commented Dec 3, 2021 •

edited

Loading

Improve BRENDA load time #1

Improve BRENDA load time #1

Comments

y1zhou commented Nov 9, 2021

y1zhou commented Dec 3, 2021 • edited Loading

y1zhou commented Dec 3, 2021 •

edited

Loading