Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Cart_sub segfault #13081

Open
mwiesenberger opened this issue Feb 5, 2025 · 4 comments
Open

MPI_Cart_sub segfault #13081

mwiesenberger opened this issue Feb 5, 2025 · 4 comments

Comments

@mwiesenberger
Copy link

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

mpirun --version reports 4.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

I run Linux mint 21.3 v6.0.4 and use the operating system apt install libopenmpi-dev

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Linux mint 21.3 v6.0.4
  • Computer hardware: Intel Xeon W-2133 CPU
  • Network type: n-a
    mpic++ --version is g++ 11.4.0

Details of the problem

This program segfaults

 //segfault.cpp          
                                                                                                                                                       
#include <iostream>
#include <cassert>
#include <mpi.h>                                                           
                 
int main( int argc, char* argv[] )
{                                 
    MPI_Init(&argc, &argv);       
                                  
    int rank, size;        
    MPI_Comm_rank( MPI_COMM_WORLD, &rank);                                
    MPI_Comm_size( MPI_COMM_WORLD, &size);
                                          
    MPI_Comm comm =MPI_COMM_WORLD;        
    int reduce_rank = rank % 2;                                                 
    int color = reduce_rank == 0 ? 1 : MPI_UNDEFINED;                              
    MPI_Comm comm_split;                                                                                           
    MPI_Comm_split( comm, color, 0, &comm_split);                                                                  
                                                                                                                   
    MPI_Comm comm2;                                                                                                
    int dims[2] = {0,0};                                                                                           
    int periods[2] = {1,1};                                                                                        
    assert( MPI_Dims_create( size, 2, dims) == MPI_SUCCESS);                                                       
    assert( MPI_Cart_create( MPI_COMM_WORLD, 2, dims, periods, true , &comm2) == MPI_SUCCESS);                     
    int remains[2] = {1,0};                                                                                        
    MPI_Comm comm2_01;                                                                                             
    assert( comm2 != MPI_COMM_NULL);                                                                               
    assert( MPI_Cart_sub( comm2, remains, &comm2_01) == MPI_SUCCESS);        // Segfault here                                      
                                                                                                                   
    MPI_Finalize();                                                                                                
}  
shell$ mpic++ segfault.cpp -o segfault -g
shell$ mpirun -n 2 ./segfault
[titanxp:09308] *** Process received signal ***
[titanxp:09308] Signal: Segmentation fault (11)
[titanxp:09308] Signal code: Address not mapped (1)
[titanxp:09308] Failing at address: 0x8
[titanxp:09308] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fa3ee5ed520]
[titanxp:09308] [ 1] /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x995)[0x7fa3cac121d5]
[titanxp:09308] [ 2] /usr/local/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_component_progress+0x324)[0x7fa3d9406af4]
[titanxp:09308] [ 3] /usr/local/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fa3ede36c2c]
[titanxp:09308] [ 4] /usr/local/lib/libmpi.so.40(ompi_request_default_wait+0x4d)[0x7fa3eea4b95d]
[titanxp:09308] [ 5] /usr/local/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xc1)[0x7fa3eeaa85d1]
[titanxp:09308] [ 6] /usr/local/lib/libmpi.so.40(ompi_coll_base_allgather_intra_two_procs+0x89)[0x7fa3eeaa77c9]
[titanxp:09308] [ 7] /usr/local/lib/libmpi.so.40(ompi_comm_split+0xc5)[0x7fa3eea2eba5]
[titanxp:09308] [ 8] /usr/local/lib/libmpi.so.40(mca_topo_base_cart_sub+0xe4)[0x7fa3eead0054]
[titanxp:09308] [ 9] /usr/local/lib/libmpi.so.40(PMPI_Cart_sub+0xca)[0x7fa3eea6805a]
[titanxp:09308] [10] ./segfault(+0x1469)[0x560e774d9469]
[titanxp:09308] [11] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fa3ee5d4d90]
[titanxp:09308] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fa3ee5d4e40]
[titanxp:09308] [13] ./segfault(+0x11e5)[0x560e774d91e5]
[titanxp:09308] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node titanxp exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
@ggouaillardet
Copy link
Contributor

FWIW I was not able to reproduce the issue but with more recent versions of Open MPI.

Out of curiosity, can you try

mpirun -np 2 --mca btl ^smcuda ./segfault

@mwiesenberger
Copy link
Author

mwiesenberger commented Feb 5, 2025

Yeah the solution probably just is to update openmpi. It would be good to know which version fixes the issue though

shell$ mpirun -n 2 --mca btl ^smcuda ./segfault
[titanxp:13818] *** Process received signal ***
[titanxp:13818] Signal: Segmentation fault (11)
[titanxp:13818] Signal code: Address not mapped (1)
[titanxp:13818] Failing at address: 0x8
[titanxp:13818] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fdf4b819520]
[titanxp:13818] [ 1] /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x995)[0x7fdf308121d5]
[titanxp:13818] [ 2] /usr/local/lib/openmpi/mca_btl_vader.so(mca_btl_vader_poll_handle_frag+0x8f)[0x7fdf36c0437f]
[titanxp:13818] [ 3] /usr/local/lib/openmpi/mca_btl_vader.so(+0x46a7)[0x7fdf36c046a7]
[titanxp:13818] [ 4] /usr/local/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fdf4b036c2c]
[titanxp:13818] [ 5] /usr/local/lib/libmpi.so.40(ompi_request_default_wait+0x4d)[0x7fdf4ba4b95d]
[titanxp:13818] [ 6] /usr/local/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xc1)[0x7fdf4baa85d1]
[titanxp:13818] [ 7] /usr/local/lib/libmpi.so.40(ompi_coll_base_allgather_intra_two_procs+0x89)[0x7fdf4baa77c9]
[titanxp:13818] [ 8] /usr/local/lib/libmpi.so.40(ompi_comm_split+0xc5)[0x7fdf4ba2eba5]
[titanxp:13818] [ 9] /usr/local/lib/libmpi.so.40(mca_topo_base_cart_sub+0xe4)[0x7fdf4bad0054]
[titanxp:13818] [10] /usr/local/lib/libmpi.so.40(PMPI_Cart_sub+0xca)[0x7fdf4ba6805a]
[titanxp:13818] [11] ./segfault(+0x1455)[0x55f545ab0455]
[titanxp:13818] [12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fdf4b800d90]
[titanxp:13818] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fdf4b800e40]
[titanxp:13818] [14] ./segfault(+0x11a5)[0x55f545ab01a5]
[titanxp:13818] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node titanxp exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@bosilca
Copy link
Member

bosilca commented Feb 5, 2025

I can't reproduce with any of the official stable releases (4.1 or 5.0). Main also works.

@ggouaillardet
Copy link
Contributor

4.0.0 crashes (if I force --mca pml ob1 --mca btl,tcp)
4.0.1 works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants