You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current version of Drudge uses Spark (PySpark) for parallelism. Despite the computational speed-up it brings, the dependence on Spark adds an extra layer of complexity for developing, maintaining, and using Drudge. Typically encountered during Drudge development is that the program crashes inside some Spark code while throwing some Scala/Java error message, which is hard to understand for a Python developer. Our current workaround is to use dummy_spark for debugging and pyspark for production. However, this two-step approach turns out to be difficult for large-scale problems. Moreover, a non-Python library complicates the deployment.
An alternative to Spark is the Dask library. Dask is implemented in pure Python and integrates well with other scientific/numeric/HPC Python libraries. It works on single workstations as well as clusters. Having a Dask backend makes it easier to debug and profile Drudge codes without sacrificing performance. For implementation, Dask collections such as dask.bag (as a replacement for Spark RDD) or dask.delayed may be a good place to start.
Gaurav and I have talked about switching from Spark to Dask. We may start experimenting when time allows.
The text was updated successfully, but these errors were encountered:
The current version of Drudge uses Spark (PySpark) for parallelism. Despite the computational speed-up it brings, the dependence on Spark adds an extra layer of complexity for developing, maintaining, and using Drudge. Typically encountered during Drudge development is that the program crashes inside some Spark code while throwing some Scala/Java error message, which is hard to understand for a Python developer. Our current workaround is to use
dummy_spark
for debugging andpyspark
for production. However, this two-step approach turns out to be difficult for large-scale problems. Moreover, a non-Python library complicates the deployment.An alternative to Spark is the Dask library. Dask is implemented in pure Python and integrates well with other scientific/numeric/HPC Python libraries. It works on single workstations as well as clusters. Having a Dask backend makes it easier to debug and profile Drudge codes without sacrificing performance. For implementation, Dask collections such as
dask.bag
(as a replacement for Spark RDD) ordask.delayed
may be a good place to start.Gaurav and I have talked about switching from Spark to Dask. We may start experimenting when time allows.
The text was updated successfully, but these errors were encountered: