Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel make -j builds fail with nvcc error : 'cudafe++' died due to signal 9 (Kill signal) #639

Open
valassi opened this issue Apr 12, 2023 · 1 comment

Comments

@valassi
Copy link
Member

valassi commented Apr 12, 2023

I am doing a routine build and test (for MR #638), with my tmad allTees script. This includes a make -j of all processes.

Up until now, on all machines and with all compiler combinations I had used, this had always succeeded. (The only situation where the build failed is gg_ttgggg in PR #601, but this is an experimental case that clearly needs fixing by splitting up a function into smaller function.)

Anyway, today I am getting the following errors

...
ccache /usr/local/cuda-12.0/bin/nvcc  -O3  -lineinfo -I. -I../../src -I../../../../../tools -I../../../../../test/googletest/googletest/include -I../../../../../test/googletest/googletest/include -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -use_fast_math -std=c++17  -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -Xcompiler -fPIC -c -x cu testmisc.cc -o build.none_m_inl0_hrd0/testmisc_cu.o
ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++  -O3  -std=c++17 -I. -I../../src -I../../../../../tools -I../../../../../test/googletest/googletest/include -I../../../../../test/googletest/googletest/include -DUSE_NVTX -Wall -Wshadow -Wextra -ffast-math  -fopenmp -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -I/usr/local/cuda-12.0/include/ -fPIC -c runTest.cc -o build.none_m_inl0_hrd0/runTest.o
ccache /usr/local/cuda-12.0/bin/nvcc  -O3  -lineinfo -I. -I../../src -I../../../../../tools -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -use_fast_math -std=c++17  -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -Xcompiler -fPIC -c -x cu fsampler.cc -o build.512y_m_inl0_hrd0/fsampler_cu.o
ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++  -O3  -std=c++17 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra -ffast-math  -fopenmp -march=skylake-avx512 -mprefer-vector-width=256  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -I/usr/local/cuda-12.0/include/ -fPIC -c fsampler.cc -o build.512y_m_inl0_hrd0/fsampler.o
ccache /usr/local/cuda-12.0/bin/nvcc  -O3  -lineinfo -I. -I../../src -I../../../../../tools -I../../../../../test/googletest/googletest/include -I../../../../../test/googletest/googletest/include -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -use_fast_math -std=c++17  -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -Xcompiler -fPIC -c -x cu runTest.cc -o build.none_m_inl0_hrd0/runTest_cu.o
ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/gfortran -I. -c fcheck_sa.f -o build.none_m_inl0_hrd0/fcheck_sa.o
ccache /usr/local/cuda-12.0/bin/nvcc  -O3  -lineinfo -I. -I../../src -I../../../../../tools -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -use_fast_math -std=c++17  -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -Xcompiler -fPIC -c -x cu fsampler.cc -o build.none_m_inl0_hrd0/fsampler_cu.o
ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++  -O3  -std=c++17 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra -ffast-math  -fopenmp -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -I/usr/local/cuda-12.0/include/ -fPIC -c fsampler.cc -o build.none_m_inl0_hrd0/fsampler.o
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
nvcc error   : 'cudafe++' died due to signal 9 (Kill signal)
make[2]: *** [cudacpp.mk:422: build.512y_m_inl0_hrd0/gRandomNumberKernels.o] Error 9
...

This is on itscrd80 with cuda 12.0 and gcc11.2.

The only things that I can think of as being different from usual are

  • of course this is a new PR so the code changed slightly, but actually it is mainly the Fortran that changed not the cuda
  • I am using itscrd80 because I have a driver issue on itscrd90 (and in the past I was on itscrd70 usually)
  • I am only running tmad allTees rather than going via tput allTees first, but the builds should be exactly the same

Maybe itscrd80 is configured differently?

Anyway, note that there are 9 nvcc errors, so I guess this was a build with parallelism 9? I will try to limit it (to 5 as I have 5 AVX builds in parallel, otherwise it serializes too much...)

@valassi
Copy link
Member Author

valassi commented Apr 12, 2023

Note, this is in gg_ttgg (I had already disabled gg_ttggg as it failed in a previous build)

valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant