The project addresses the existing impedance mismatch between data-intensive and compute-intensive ecosystems by extending the Spark platform with the MPI-based inter-worker communication model for supporting HPC applications. The rationale along with a general description are provided in the arXiv, NYSDS paper and Spark Summit East'17 talk (located in the doc directory) :
- Spark-MPI: Approaching the Fifth Paradigm of Cognitive Applications, arXiv:1806.01110, May 16, 2018
- Building Near-Real-Time Processing Pipelines with the Spark-MPI platform, NYSDS, New York, August 7-9, 2017
- Bringing HPC Algorithms to Big Data Platforms, Spark Summit East, Boston, February 7-9, 2017
The Spark-MPI approach is illustrated within the context of a conceptual demo (located in the examples/spark-mpi directory) which runs the MPI Allreduce method on the Spark workers.
- Anaconda3-4.2.0 with Python 3.5 (note: Spark 2.1 does not support Python 3.6)
install anaconda
conda install libgcc
download spark
cd spark
export PYSPARK_PYTHON=python3
./build/mvn -DskipTests clean package
./configure --prefix=<installation directory> --with-cuda --with-libevent=external
make
make install
- MPI python wrapper, for example mpi4py 3.0
wget https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-3.0.0.tar.gz
python setup.py build
python setup.py install
export MPI_SRC=<Open MPI build directory>
git clone https://github.com/SciDriver/spark-mpi.git
mkdir build
cd build
cmake ../spark-mpi -DCMAKE_INSTALL_PREFIX=<installation directory>
make
sudo make install