Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in dnrformat_loc3d.c #129

Open
jeanlucf22 opened this issue Nov 23, 2022 · 15 comments
Open

Segfault in dnrformat_loc3d.c #129

jeanlucf22 opened this issue Nov 23, 2022 · 15 comments

Comments

@jeanlucf22
Copy link
Contributor

I am running into a runtime issue. I am trying to use the pdgssvx3d() solver on a single GPU (1 MPI task). The code I am using is written following the example from PDDRIVE3D. I actually have a stand alone driver that seems to work. But once integrated into the real application, I am getting a segfault. It seems related to the redistribution of the solution vector, which should be trivial in my case since I have only one MPI task. I am showing here the end of my output.

....
.. B to X redistribute time 0.0023
.. Setup L-solve time 0.0001
.. L-solve time 0.0025
.. L-solve time (MAX) 0.0025
.. Setup U-solve time 0.0005
.. U-solve time 0.0032
.. U-solve time (MAX) 0.0032
.. X to B redistribute time 0.0003
( 0) .. Step 1: berr[j] = 3.091400e-16
.. GPU trisolve
num_thread: 1
.. B to X redistribute time 0.0022
.. Setup L-solve time 0.0001
.. L-solve time 0.0026
.. L-solve time (MAX) 0.0026
.. Setup U-solve time 0.0005
.. U-solve time 0.0032
.. U-solve time (MAX) 0.0032
.. X to B redistribute time 0.0003
( 0) .. Step 2: berr[j] = 3.852047e-16
.. DiagScale = 1
[narsil-gpu2:1555296:0:1555296] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x14a6bd895c80)
==== backtrace (tid:1555296) ====
0 0x0000000000012ce0 __funlockfile() :0
1 0x00000000000cb86a matCopy() /home/q8j/GIT/superlu_dist/SRC/dnrformat_loc3d.c:36
2 0x00000000000cd175 dScatter_B3d() /home/q8j/GIT/superlu_dist/SRC/dnrformat_loc3d.c:570
3 0x00000000000cb7cd pdgssvx3d() /home/q8j/GIT/superlu_dist/SRC/pdgssvx3d.c:1699

@xiaoyeli
Copy link
Owner

It dies in iterative refinement, after 2 iterations. Can you turn off refinement?
In fact, your first solution already looks good -- backward error 'berr' is small:
( 0) .. Step 1: berr[j] = 3.091400e-16

@xiaoyeli
Copy link
Owner

Sorry, it seems it passes iterative refinement (you can still safely turn it off.)
It probably dies in the final redistribution of the solution vector X.
Which version (or commit) are you using? The line numbers reported to not match the master branch.

@jeanlucf22
Copy link
Contributor Author

Sorry, I was using an older commit. Here is the error call stack with the latest master commit:

[narsil-gpu2:1602573:0:1602573] Caught signal 11 (Segmentation fault: address not mapped to object at ad
dress 0x14ab09895c80)
==== backtrace (tid:1602573) ====
0 0x0000000000012ce0 __funlockfile() :0
1 0x00000000000c6533 matCopy() /home/q8j/GIT/superlu_dist/SRC/dnrformat_loc3d.c:34
2 0x00000000000c7dd8 dScatter_B3d() /home/q8j/GIT/superlu_dist/SRC/dnrformat_loc3d.c:565
3 0x00000000000c6496 pdgssvx3d() /home/q8j/GIT/superlu_dist/SRC/pdgssvx3d.c:1713

@xiaoyeli
Copy link
Owner

Hard to tell what my be wrong. In this line:
2 0x00000000000c7dd8 dScatter_B3d() /home/q8j/GIT/superlu_dist/SRC/dnrformat_loc3d.c:565
matCopy(A3d->m_loc, nrhs, B, ldb, Btmp, A3d->m_loc);

Can you do some detective work, printing nrhs, ldb, A3d->m_loc, and the addesses of B and Btmp ??

@jeanlucf22
Copy link
Contributor Author

This is what cuda-gdb gives me:

signal SIGSEGV, Segmentation fault.
0x000015552097d533 in matCopy (n=38315, m=1, Dst=0x1553f7895c80, lddst=38315,
Src=0x2a368b70, ldsrc=38315)
at /home/q8j/GIT/superlu_dist/SRC/dnrformat_loc3d.c:34
34 Dst[i + lddst * j] = Src[i + ldsrc * j];
(cuda-gdb) where
#0 0x000015552097d533 in matCopy (n=38315, m=1, Dst=0x1553f7895c80,
lddst=38315, Src=0x2a368b70, ldsrc=38315)
at /home/q8j/GIT/superlu_dist/SRC/dnrformat_loc3d.c:34
#1 0x000015552097edd8 in dScatter_B3d (A3d=0x39ecee30, grid3d=0x39180bf8)
at /home/q8j/GIT/superlu_dist/SRC/dnrformat_loc3d.c:565
#2 0x000015552097d496 in pdgssvx3d (options=0x39180c80, A=0x39180b88,
ScalePermstruct=0x7fffffff0c70, B=0x3a8dac90, ldb=38315, nrhs=1,
grid3d=0x39180bf8, LUstruct=0x7fffffff0c50, SOLVEstruct=0x7fffffff0ca0,
berr=0x39f5a8a0, stat=0x7fffffff0ba0, info=0x7fffffff0cfc)
at /home/q8j/GIT/superlu_dist/SRC/pdgssvx3d.c:1713

@xiaoyeli
Copy link
Owner

The numbers all look correct to me. It dies at line 34, in the loop. Do you know what is "i" at that point?

@jeanlucf22
Copy link
Contributor Author

i=0, j=0, Src[0]=-7.5099551745997222e-13,

(cuda-gdb) print Dst[0]
Cannot access memory at address 0x1553f7895c80

@xiaoyeli
Copy link
Owner

The copy destination 'Dst' is just the right-hand side of the original linear system. Somehow it got interference from your program.
Can you run the standalone pddrive3d(), set a breakpoint here, see whether you get valid address?

@jeanlucf22
Copy link
Contributor Author

After further investigations, it looks like the issue is related to the interaction between superlu and Kokkos or Trilinos which my application is using. A stand alone driver runs fine.
If I copy my Trilinos vector into a temporary dynamically located array, that I then pass to the superlu solver, the solver completes without error. But it seems to mess up with my Trilinos vector (actually a Tpetra::MultiVector) which becomes corrupted after that: an attempt at printing its values leads to a segfault...
The mystery continues...

@xiaoyeli
Copy link
Owner

Do you know how the vector Tpetra::MultiVector is stored? Perhaps it contains some structure. i.e., not a flat vector.

@jwillenbring
@egboman

@egboman
Copy link

egboman commented Dec 20, 2022

Tpetra::MultiVector contains a 2D Kokkos::View with the actual data. I suggest you can get the Kokkos::View and pass that to SuperLU. See https://docs.trilinos.org/dev/packages/tpetra/doc/html/classTpetra_1_1MultiVector.html

@jeanlucf22
Copy link
Contributor Author

@egboman This is what I do. Here are a few details about the code that fails (in call to pdgssvx3d):

void SuperLUSolver::solve(const Tpetra::MultiVector<>& b,
Tpetra::MultiVector<>& x)
{
// copy b into x
Tpetra::deep_copy(x, b);

// Get Kokkos Views
x.clear_sync_state();
auto x_view = xvec->getLocalViewHost(Tpetra::Access::OverwriteAllStruct());
double* xv=x_view.data();

...
pdgssvx3d(&_options, &_A, &ScalePermstruct, xv, _num_rows, 1, &_grid,
&LUstruct, &SOLVEstruct, berr, &stat, &info);
...

@xiaoyeli
Copy link
Owner

@jeanlucf22
can you print out some entries of xv before & right in the beginning of pdgssvx3d ?

@egboman
Copy link

egboman commented Dec 20, 2022

@jeanlucf22 Looks like you are doing things right on the Trilinos/Tpetra side. Sorry, not sure what the problem is here.
I'm curious: The Amesos2 package already supports SuperLU-dist as a solver, so why do you need to do this? Are you using new 3d features in SuperLU-dist not supported in Amesos2?

@jeanlucf22
Copy link
Contributor Author

@xiaoyeli Printed out values of xv before superlu call look fine (I have not tried to do it inside superlu). I can even copy those values into a simple C++ array and pass it to superlu. Runs fine, but then it looks like xv is corrupted when I try to access it after the superlu call, as if superlu was somehow messing up its allocation... (I don't understand how)
@egboman I am indeed using a different solver not currently supported by Trilinos. Also, using it as a local solver (single MPI) to precondition outer solver, for which we already had the structure in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants