Replace Spark with Dask #15

chenpeizhi · 2020-03-27T19:38:14Z

The current version of Drudge uses Spark (PySpark) for parallelism. Despite the computational speed-up it brings, the dependence on Spark adds an extra layer of complexity for developing, maintaining, and using Drudge. Typically encountered during Drudge development is that the program crashes inside some Spark code while throwing some Scala/Java error message, which is hard to understand for a Python developer. Our current workaround is to use dummy_spark for debugging and pyspark for production. However, this two-step approach turns out to be difficult for large-scale problems. Moreover, a non-Python library complicates the deployment.

An alternative to Spark is the Dask library. Dask is implemented in pure Python and integrates well with other scientific/numeric/HPC Python libraries. It works on single workstations as well as clusters. Having a Dask backend makes it easier to debug and profile Drudge codes without sacrificing performance. For implementation, Dask collections such as dask.bag (as a replacement for Spark RDD) or dask.delayed may be a good place to start.

Gaurav and I have talked about switching from Spark to Dask. We may start experimenting when time allows.

The text was updated successfully, but these errors were encountered:

chenpeizhi self-assigned this Jul 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Spark with Dask #15

Replace Spark with Dask #15

chenpeizhi commented Mar 27, 2020 •

edited

Loading

Replace Spark with Dask #15

Replace Spark with Dask #15

Comments

chenpeizhi commented Mar 27, 2020 • edited Loading

chenpeizhi commented Mar 27, 2020 •

edited

Loading