rrdesi_mpi performance notes #191

dmargala · 2021-03-25T00:55:00Z

This is a summary of quick performance study of rrdesi_mpi. From the profile (see image below), it looks like the computation bottleneck is in calc_zchi2/calc_zchi2_one at the dot product between a sparse resolution matrix and a spectral template here. For GPU-offloading, it might be beneficial to "stack" or "batch" the operations in calc_zchi2/calc_zchi2_one but it would require deeper investigation in order to understand how targets/templates/redshifts are distributed amongst mpi ranks.

Single Cori Haswell node performance:

#- DESI environment used for cascades, swapping out your own copy of redrock
source /global/cfs/cdirs/desi/software/desi_environment.sh 21.2
module unload redrock

git clone https://github.com/desihub/redrock
export PYTHONPATH=$(pwd)/redrock/py:$PYTHONPATH
export PATH=$(pwd)/redrock/bin:$PATH
export RR_TEMPLATE_DIR=/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/code/redrock-templates/0.7.2

#- Run redrock with 32x MPI parallelism on an interactive node (~3m20s)
salloc -N 1 -C haswell -q interactive -t 1:00:00
export OMP_NUM_THREADS=1
cd /global/cfs/cdirs/desi/spectro/redux/cascades/tiles/80605/20201215/
time srun -n 32 -c 2 rrdesi_mpi spectra-0-80605-20201215.fits -o $SCRATCH/redrock-0-80605-20201215.h5 -z $SCRATCH/zbest-0-80605-20201215.fits
...
Computing redshifts took: 147.6 seconds
Writing zscan data took: 1.1 seconds
Writing zbest data took: 24.9 seconds
Total run time: 191.2 seconds

real	3m18.792s
user	0m0.071s
sys	0m0.037s

Use python -m cProfile ... with a single mpi rank to generate a profile:

srun -n 1 -c 2 --cpu-bind=cores python -m cProfile -o profile.pstats $(which rrdesi_mpi) spectra-0-80605-20201215.fits -o $SCRATCH/redrock-0-80605-20201215.h5 -z $SCRATCH/zbest-0-80605-20201215.fits
...
Computing redshifts took: 2639.9 seconds
Writing zscan data took: 1.2 seconds
Writing zbest data took: 24.8 seconds
Total run time: 2693.3 seconds

Half DGX node performance:

#- start an interactive session using half a DGX node
salloc -C dgx -N 1 -G 4 -c 64 -t 60
module load python cuda/11.1.1 gcc openmpi

#- create a fresh gpu+mpi ready conda env
conda create -n gpu-redrock-dgx
source activate gpu-redrock-dgx
conda install -y numpy scipy numba pyyaml astropy matplotlib
pip install fitsio healpy speclite cupy-cuda111 h5py
#- build mpi from source
git clone https://bitbucket.org/mpi4py/mpi4py.git
cd mpi4py/
python setup.py build
python setup.py install
cd ..
#- desi specific pip installs
pip install git+https://github.com/desihub/desiutil.git
pip install git+https://github.com/desihub/desitarget.git
pip install git+https://github.com/desihub/desispec.git

#- Install redrock in develop mode for experimenting
git clone https://github.com/desihub/redrock
cd redrock
pip install -e .

export RR_TEMPLATE_DIR=/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/code/redrock-templates/0.7.2
export OMP_NUM_THREADS=1
cd /global/cfs/cdirs/desi/spectro/redux/cascades/tiles/80605/20201215/
time srun -n 32 -c 2 --cpu-bind=cores rrdesi_mpi spectra-0-80605-20201215.fits -o $SCRATCH/redrock-0-80605-20201215.h5 -z $SCRATCH/zbest-0-80605-20201215.fits
...
Computing redshifts took: 81.7 seconds
Writing zscan data took: 4.6 seconds
Writing zbest data took: 0.1 seconds
Total run time: 93.6 seconds

real	1m58.986s
user	0m0.020s
sys	0m0.043s

The text was updated successfully, but these errors were encountered:

dmargala mentioned this issue Feb 17, 2022

gpu-ify zscan #204

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rrdesi_mpi performance notes #191

rrdesi_mpi performance notes #191

dmargala commented Mar 25, 2021

rrdesi_mpi performance notes #191

rrdesi_mpi performance notes #191

Comments

dmargala commented Mar 25, 2021