Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Spark with Dask #15

Open
chenpeizhi opened this issue Mar 27, 2020 · 0 comments
Open

Replace Spark with Dask #15

chenpeizhi opened this issue Mar 27, 2020 · 0 comments
Assignees

Comments

@chenpeizhi
Copy link
Collaborator

chenpeizhi commented Mar 27, 2020

The current version of Drudge uses Spark (PySpark) for parallelism. Despite the computational speed-up it brings, the dependence on Spark adds an extra layer of complexity for developing, maintaining, and using Drudge. Typically encountered during Drudge development is that the program crashes inside some Spark code while throwing some Scala/Java error message, which is hard to understand for a Python developer. Our current workaround is to use dummy_spark for debugging and pyspark for production. However, this two-step approach turns out to be difficult for large-scale problems. Moreover, a non-Python library complicates the deployment.

An alternative to Spark is the Dask library. Dask is implemented in pure Python and integrates well with other scientific/numeric/HPC Python libraries. It works on single workstations as well as clusters. Having a Dask backend makes it easier to debug and profile Drudge codes without sacrificing performance. For implementation, Dask collections such as dask.bag (as a replacement for Spark RDD) or dask.delayed may be a good place to start.

Gaurav and I have talked about switching from Spark to Dask. We may start experimenting when time allows.

@chenpeizhi chenpeizhi self-assigned this Jul 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant