-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in dnrformat_loc3d.c #129
Comments
It dies in iterative refinement, after 2 iterations. Can you turn off refinement? |
Sorry, it seems it passes iterative refinement (you can still safely turn it off.) |
Sorry, I was using an older commit. Here is the error call stack with the latest master commit: [narsil-gpu2:1602573:0:1602573] Caught signal 11 (Segmentation fault: address not mapped to object at ad |
Hard to tell what my be wrong. In this line: Can you do some detective work, printing nrhs, ldb, A3d->m_loc, and the addesses of B and Btmp ?? |
This is what cuda-gdb gives me: signal SIGSEGV, Segmentation fault. |
The numbers all look correct to me. It dies at line 34, in the loop. Do you know what is "i" at that point? |
i=0, j=0, Src[0]=-7.5099551745997222e-13, (cuda-gdb) print Dst[0] |
The copy destination 'Dst' is just the right-hand side of the original linear system. Somehow it got interference from your program. |
After further investigations, it looks like the issue is related to the interaction between superlu and Kokkos or Trilinos which my application is using. A stand alone driver runs fine. |
Do you know how the vector Tpetra::MultiVector is stored? Perhaps it contains some structure. i.e., not a flat vector. |
Tpetra::MultiVector contains a 2D Kokkos::View with the actual data. I suggest you can get the Kokkos::View and pass that to SuperLU. See https://docs.trilinos.org/dev/packages/tpetra/doc/html/classTpetra_1_1MultiVector.html |
@egboman This is what I do. Here are a few details about the code that fails (in call to pdgssvx3d): void SuperLUSolver::solve(const Tpetra::MultiVector<>& b,
... |
@jeanlucf22 |
@jeanlucf22 Looks like you are doing things right on the Trilinos/Tpetra side. Sorry, not sure what the problem is here. |
@xiaoyeli Printed out values of xv before superlu call look fine (I have not tried to do it inside superlu). I can even copy those values into a simple C++ array and pass it to superlu. Runs fine, but then it looks like xv is corrupted when I try to access it after the superlu call, as if superlu was somehow messing up its allocation... (I don't understand how) |
I am running into a runtime issue. I am trying to use the pdgssvx3d() solver on a single GPU (1 MPI task). The code I am using is written following the example from PDDRIVE3D. I actually have a stand alone driver that seems to work. But once integrated into the real application, I am getting a segfault. It seems related to the redistribution of the solution vector, which should be trivial in my case since I have only one MPI task. I am showing here the end of my output.
....
.. B to X redistribute time 0.0023
.. Setup L-solve time 0.0001
.. L-solve time 0.0025
.. L-solve time (MAX) 0.0025
.. Setup U-solve time 0.0005
.. U-solve time 0.0032
.. U-solve time (MAX) 0.0032
.. X to B redistribute time 0.0003
( 0) .. Step 1: berr[j] = 3.091400e-16
.. GPU trisolve
num_thread: 1
.. B to X redistribute time 0.0022
.. Setup L-solve time 0.0001
.. L-solve time 0.0026
.. L-solve time (MAX) 0.0026
.. Setup U-solve time 0.0005
.. U-solve time 0.0032
.. U-solve time (MAX) 0.0032
.. X to B redistribute time 0.0003
( 0) .. Step 2: berr[j] = 3.852047e-16
.. DiagScale = 1
[narsil-gpu2:1555296:0:1555296] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x14a6bd895c80)
==== backtrace (tid:1555296) ====
0 0x0000000000012ce0 __funlockfile() :0
1 0x00000000000cb86a matCopy() /home/q8j/GIT/superlu_dist/SRC/dnrformat_loc3d.c:36
2 0x00000000000cd175 dScatter_B3d() /home/q8j/GIT/superlu_dist/SRC/dnrformat_loc3d.c:570
3 0x00000000000cb7cd pdgssvx3d() /home/q8j/GIT/superlu_dist/SRC/pdgssvx3d.c:1699
The text was updated successfully, but these errors were encountered: