-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault using mpi GAP_fit #673
Comments
A suggestion that sometimes helps: have before you execute the command. Let us know if this helps. |
Hi, thanks for the quick response. I already have this in my slurm submission script unfortunately. |
Thanks. Is the snippet above the full output of your program? I.e. it stops after |
Do you mean is the segmentation fault above the hello world? It seems to run all the way up until QR decomposition then crashes and returns the memory error. |
Can you post the output, please? Thanks. |
This is quite long but:
|
Sorry, I meant the bits after the hello world when it collects the number of descriptors etc. |
Oh sorry. Here's the first bit:
|
All I can think based on this is that there is an additional chunk which is not accounted for becomes significant. Interesting would be to either try to run the same fit with e.g. half the database or use more nodes (while use more openmp threads as well). I know this not super useful. |
I've just tried running with just the 2b descriptor and it works fine. Increasing the nodes and changing the database size didn't seem to help unfortunately. I guess I'll have to trial and error to see whats going wrong. Thanks for the help, please let me know if you have any other suggestions. |
Also does the fact that the displayed available memory isn't consistent with the actual amount I am using indicate an error with the compilation? Or does GAP not show the TOTAL amount of memory from all nodes? |
No, this simply reports the memory on the root node. I agree that the text is somewhat misleading. |
What is the distribution of the sizes (number of atoms) of your configurations? |
Each configuration has ca. 800 atoms. I'm guessing that I need to reduced the number of OMP threads to accommodate for this? |
I was worried if the distribution is very uneven, in which case the task manager might run into problems.
No, in fact, you can use a higher value. Both the descriptor and covariance calculations can make use of that. There is no hard rule on how high, see the paper for our findings. |
Okay, I will give a try with more threads. |
Any luck with solving this @Ash-Dickson ? I think I might have the same issue. I also get segfault at a similar place. I have many elements and a LOT of descriptors (459 to be exact, mostly Here is my backtrace, any clues from this @albapa ?
|
Hi @jesperbygg , I believe the issue for me was OMP_STACKSIZE being too low. I increased this and have since had no issues with my training (although as you say I am using considerably less descriptors and sparse points). I have found that I still run into segfault issues for larger amounts of training data, but increasing the number of nodes I'm using seems to sort this. I also found that using a fairly large number of OMP thread helped combat these issues. I'm sorry this isn't particularly helpful, best of luck with finding a solution! |
No that is very helpful, @Ash-Dickson, thank you very much. I will do some testing. |
Thanks to both of you for these messages. The stack trace is super useful, I will have a look. Also I didn't realise OMP_STACKSIZE would affect the training process, thanks for this insight. |
@jesperbygg I am wondering if you could add a print statement before the current line 983 in |
Thanks for the suggestion, @albapa. Indeed, the problem is that the Now I need to figure out how to let the allocatable array size be defined by larger integers. Any tips are welcome. I don't think declaring |
Hi all,
I've been having issues trying to utilise the mpi version of GAP_fit. I compiled with the latest version of QUIP, as per the instructions provided on github (including the added steps for mpi). When I try to fit a potential, I get a segmentation fault during the calculation of the sparse covariance matrices:
Further to this, the total system memory doesn't seem to display the memory I would expect. For instance, when using 1 node with 256 GB of memory, the total system memory is 256. However, when running with e.g. 4 nodes, this number remains the same. I compiled on archer2 with the existing architecture file for archer2+openmp+openmpi.
The details of my GAP installation are below:
My GAP input is as follows (I presume this is correct after the update to allow single run sparsification?):
Thank you in advance for any help!
The text was updated successfully, but these errors were encountered: