You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a summary of quick performance study of rrdesi_mpi. From the profile (see image below), it looks like the computation bottleneck is in calc_zchi2/calc_zchi2_one at the dot product between a sparse resolution matrix and a spectral template here. For GPU-offloading, it might be beneficial to "stack" or "batch" the operations in calc_zchi2/calc_zchi2_one but it would require deeper investigation in order to understand how targets/templates/redshifts are distributed amongst mpi ranks.
Single Cori Haswell node performance:
#- DESI environment used for cascades, swapping out your own copy of redrock
source /global/cfs/cdirs/desi/software/desi_environment.sh 21.2
module unload redrock
git clone https://github.com/desihub/redrock
export PYTHONPATH=$(pwd)/redrock/py:$PYTHONPATH
export PATH=$(pwd)/redrock/bin:$PATH
export RR_TEMPLATE_DIR=/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/code/redrock-templates/0.7.2
#- Run redrock with 32x MPI parallelism on an interactive node (~3m20s)
salloc -N 1 -C haswell -q interactive -t 1:00:00
export OMP_NUM_THREADS=1
cd /global/cfs/cdirs/desi/spectro/redux/cascades/tiles/80605/20201215/
time srun -n 32 -c 2 rrdesi_mpi spectra-0-80605-20201215.fits -o $SCRATCH/redrock-0-80605-20201215.h5 -z $SCRATCH/zbest-0-80605-20201215.fits
...
Computing redshifts took: 147.6 seconds
Writing zscan data took: 1.1 seconds
Writing zbest data took: 24.9 seconds
Total run time: 191.2 seconds
real 3m18.792s
user 0m0.071s
sys 0m0.037s
Use python -m cProfile ... with a single mpi rank to generate a profile:
This is a summary of quick performance study of
rrdesi_mpi
. From the profile (see image below), it looks like the computation bottleneck is incalc_zchi2
/calc_zchi2_one
at the dot product between a sparse resolution matrix and a spectral template here. For GPU-offloading, it might be beneficial to "stack" or "batch" the operations incalc_zchi2
/calc_zchi2_one
but it would require deeper investigation in order to understand how targets/templates/redshifts are distributed amongst mpi ranks.Single Cori Haswell node performance:
Use
python -m cProfile ...
with a single mpi rank to generate a profile:Half DGX node performance:
The text was updated successfully, but these errors were encountered: