Parallel make -j builds fail with nvcc error : 'cudafe++' died due to signal 9 (Kill signal) #639

valassi · 2023-04-12T07:04:59Z

I am doing a routine build and test (for MR #638), with my tmad allTees script. This includes a make -j of all processes.

Up until now, on all machines and with all compiler combinations I had used, this had always succeeded. (The only situation where the build failed is gg_ttgggg in PR #601, but this is an experimental case that clearly needs fixing by splitting up a function into smaller function.)

Anyway, today I am getting the following errors

...
ccache /usr/local/cuda-12.0/bin/nvcc  -O3  -lineinfo -I. -I../../src -I../../../../../tools -I../../../../../test/googletest/googletest/include -I../../../../../test/googletest/googletest/include -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -use_fast_math -std=c++17  -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -Xcompiler -fPIC -c -x cu testmisc.cc -o build.none_m_inl0_hrd0/testmisc_cu.o
ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++  -O3  -std=c++17 -I. -I../../src -I../../../../../tools -I../../../../../test/googletest/googletest/include -I../../../../../test/googletest/googletest/include -DUSE_NVTX -Wall -Wshadow -Wextra -ffast-math  -fopenmp -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -I/usr/local/cuda-12.0/include/ -fPIC -c runTest.cc -o build.none_m_inl0_hrd0/runTest.o
ccache /usr/local/cuda-12.0/bin/nvcc  -O3  -lineinfo -I. -I../../src -I../../../../../tools -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -use_fast_math -std=c++17  -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -Xcompiler -fPIC -c -x cu fsampler.cc -o build.512y_m_inl0_hrd0/fsampler_cu.o
ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++  -O3  -std=c++17 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra -ffast-math  -fopenmp -march=skylake-avx512 -mprefer-vector-width=256  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -I/usr/local/cuda-12.0/include/ -fPIC -c fsampler.cc -o build.512y_m_inl0_hrd0/fsampler.o
ccache /usr/local/cuda-12.0/bin/nvcc  -O3  -lineinfo -I. -I../../src -I../../../../../tools -I../../../../../test/googletest/googletest/include -I../../../../../test/googletest/googletest/include -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -use_fast_math -std=c++17  -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -Xcompiler -fPIC -c -x cu runTest.cc -o build.none_m_inl0_hrd0/runTest_cu.o
ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/gfortran -I. -c fcheck_sa.f -o build.none_m_inl0_hrd0/fcheck_sa.o
ccache /usr/local/cuda-12.0/bin/nvcc  -O3  -lineinfo -I. -I../../src -I../../../../../tools -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -use_fast_math -std=c++17  -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -Xcompiler -fPIC -c -x cu fsampler.cc -o build.none_m_inl0_hrd0/fsampler_cu.o
ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++  -O3  -std=c++17 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra -ffast-math  -fopenmp -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -I/usr/local/cuda-12.0/include/ -fPIC -c fsampler.cc -o build.none_m_inl0_hrd0/fsampler.o
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
make[2]: *** [cudacpp.mk:422: build.512y_m_inl0_hrd0/gRandomNumberKernels.o] Error 9
...

This is on itscrd80 with cuda 12.0 and gcc11.2.

The only things that I can think of as being different from usual are

of course this is a new PR so the code changed slightly, but actually it is mainly the Fortran that changed not the cuda
I am using itscrd80 because I have a driver issue on itscrd90 (and in the past I was on itscrd70 usually)
I am only running tmad allTees rather than going via tput allTees first, but the builds should be exactly the same

Maybe itscrd80 is configured differently?

Anyway, note that there are 9 nvcc errors, so I guess this was a build with parallelism 9? I will try to limit it (to 5 as I have 5 AVX builds in parallel, otherwise it serializes too much...)

The text was updated successfully, but these errors were encountered:

valassi · 2023-04-12T07:13:43Z

Note, this is in gg_ttgg (I had already disabled gg_ttggg as it failed in a previous build)

…fe++ died due to signal 9" (madgraph5#639)

valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 13, 2023

[vecsizeFIX] use make -j5 to limit build parallelism to avoid "cuda…

b26aabd

…fe++ died due to signal 9" (madgraph5#639)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel make -j builds fail with nvcc error : 'cudafe++' died due to signal 9 (Kill signal) #639

Parallel make -j builds fail with nvcc error : 'cudafe++' died due to signal 9 (Kill signal) #639

valassi commented Apr 12, 2023

valassi commented Apr 12, 2023

Parallel make -j builds fail with nvcc error : 'cudafe++' died due to signal 9 (Kill signal) #639

Parallel make -j builds fail with nvcc error : 'cudafe++' died due to signal 9 (Kill signal) #639

Comments

valassi commented Apr 12, 2023

valassi commented Apr 12, 2023