[ERROR] Segfault in extract_forceconstants when built with -Ofast #122

ejmeitz · 2025-01-22T17:06:10Z

The Error: Segmentation fault when running extract_forceconstants -rc2 4.0 with the optimization flag set to -Ofast (Tried -Og to debug and it works fine with that).

The input data I attached has 5000 samples (yes that's a lot but I have 96GB of RAM and the IFC calculation does not need that much). The error only occurs when N_timesteps is set to 5000 inside of infile.meta the segfault occurs somewhere in the symmetry calculation as the last output to screen was:

 GENERATING CONSTRAINTS
 ... 0 secondorder constraints (0.00027s)
 ... created 0 constraints (0.00029s)
... creating fc coefficients            100.0% |========================================|  4.58623s
 ... least squares solver
 ... solved second order (1.06431s)
 ... finished solving for forceconstants
 ... diagnostics from three-fold cross-validation:
Segmentation fault

if I set N_timesteps to like 2500 the code runs a bit longer and crashes somewhere after this line based on the screen output which ended at:

 ... energies writen to `outfile.energies`

 ENERGIES (meV/atom):
                     rms          std          std(res)
          input:     9.693095     0.551539     -
   second order:     9.693138     0.552289     0.016234
Segmentation fault

I'm on git commit 23c403e. I went and set the optimization level to -Og to try and debug and the error went away so I can't give much more information. But its seems related to the number of samples somehow. Yeah this is WAY more samples than TDEP is probably intended to use, but thought I'd report since its unclear to me what's happening.

input_data.zip
position_data.zip
It won't let me upload the infile.forces for some reason....

The text was updated successfully, but these errors were encountered:

mjv500 · 2025-01-22T17:41:16Z

Hi Ethan, thanks for the heads up and flagging. I agree this is not the normal use case, and it's strange that it's not a memory issue. Perhaps you can compile with Ofast -g -check all -warn all -diag-enable remark and see what happens? best Matthieu

…

On Wed, Jan 22, 2025 at 6:06 PM Ethan Meitz ***@***.***> wrote: The Error: Segmentation fault when running extract_forceconstants -rc2 4.0 with the optimization flag set to -Ofast (Tried -Og to debug and it works fine with that). The input data I attached has 5000 samples (yes that's a lot but I have 96GB of RAM and the IFC calculation does not need that much). The error only occurs when N_timesteps is set to 5000 inside of infile.meta the segfault occurs somewhere in the symmetry calculation as the last output to screen was: GENERATING CONSTRAINTS ... 0 secondorder constraints (0.00027s) ... created 0 constraints (0.00029s) ... creating fc coefficients 100.0% |========================================| 4.58623s ... least squares solver ... solved second order (1.06431s) ... finished solving for forceconstants ... diagnostics from three-fold cross-validation: Segmentation fault if I set N_timesteps to like 2500 the code runs a bit longer and crashes somewhere after this <https://github.com/tdep-developers/tdep/blob/23c403ee227fe92bd04f0f537ea74e2fae5b2e5f/src/extract_forceconstants/main.f90#L602> line based on the screen output which ended at: ... energies writen to `outfile.energies` ENERGIES (meV/atom): rms std std(res) input: 9.693095 0.551539 - second order: 9.693138 0.552289 0.016234 Segmentation fault I'm on git commit 23c403e. I went and set the optimization level to -Og to try and debug and the error went away so I can't give much more information. But its seems related to the number of samples somehow. Yeah this is WAY more samples than TDEP is probably intended to use, but thought I'd report since its unclear to me what's happening. input_data.zip <https://github.com/user-attachments/files/18509129/input_data.zip> position_data.zip <https://github.com/user-attachments/files/18509137/position_data.zip> — Reply to this email directly, view it on GitHub <#122>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABWOT377FS4GNSGSM2FNJOT2L7FZRAVCNFSM6AAAAABVVNU4A6VHI2DSMVQWIX3LMV43ASLTON2WKOZSHAYDIOJRGQYDOOI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Professor Matthieu J Verstraete Fellow, American Physical Society Chair, Steering Committee, European Theoretical Spectroscopy Facility www.etsf.eu Alumnus Fellow, Young Academy of Europe yacadeuro.org Institute for Theoretical Physics (ITP) Department of Physics Buys Ballot Gebouw/Building, Princetonplein 5, office University of Utrecht, 3584 CC Utrecht ITP Secretariat: +31 30 253 5928 E-mail: ***@***.*** Group web page (Liège): http://www.nanomat.ulg.ac.be/ Nanomat lab, Q-Mat center, Université de Liège Département de Physique, Bat. B5a, 4/50 Allée du 6 août, 19 B-4000 Sart Tilman, Liège Belgium Phone : +32 4 366 90 17 European Theoretical Spectroscopy Facility (ETSF) Mail : ***@***.*** ***@***.*** ***@***.***

ejmeitz · 2025-01-22T21:17:15Z

Ok well ran gdb with those flags on and I can get a line number now for both cases. I don't have polar interactions in this simulation so the fp array should be empty.

With N = 2500 segfaults at:

With N = 5000 segfaults at

mjv500 · 2025-01-23T08:45:42Z

Hi Ethan, fp is empty but allocated, so it should fail at first use if memory is insufficient or something like that. Could try resetting it to 0 just before these lines, but the fact that the segfault dances around with N is suspicious, it could be due to a completely different line/array. What is the lower level of N which fails? We might put a warning at least in the code that TDEP is not designed for that type of scaling (though there is no excuse for a segfault). Have you tried valgrind as well?

…

On Wed, Jan 22, 2025 at 10:17 PM Ethan Meitz ***@***.***> wrote: Ok well ran gdb with those flags on and I can get a line number now for both cases. I don't have polar interactions in this simulation so the fp array *should* be empty. With N = 2500 segfaults at: Thread 1 "extract_forceco" received signal SIGSEGV, Segmentation fault. 0x000000000040fc24 in extract_forceconstants () at main.f90:611 611 sfp = lo_stddev(f0 - fp)*lo_force_hartreebohr_to_eVa With N = 5000 segfaults at Thread 1 "extract_forceco" received signal SIGSEGV, Segmentation fault. 0x0000000000534554 in ifc_solvers.ifc_solvers_diagnostics::report_diagnostics (map=..., tp=..., ih=..., dh=..., mw=..., mem=..., verbosity=<optimized out>) at ifc_solvers_diagnostics.f90:110 110 force_rsq(div, 2) = distributed_rsquare(f2 + fp, f0, nrow, mw) — Reply to this email directly, view it on GitHub <#122 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABWOT34A5AOUCCWMAO4FPRT2MADHDAVCNFSM6AAAAABVVNU4A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBYGI4DQOBVHA> . You are receiving this because you commented.Message ID: ***@***.***>

-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Professor Matthieu J Verstraete Fellow, American Physical Society Chair, Steering Committee, European Theoretical Spectroscopy Facility www.etsf.eu Alumnus Fellow, Young Academy of Europe yacadeuro.org Institute for Theoretical Physics (ITP) Department of Physics Buys Ballot Gebouw/Building, Princetonplein 5, office University of Utrecht, 3584 CC Utrecht ITP Secretariat: +31 30 253 5928 E-mail: ***@***.*** Group web page (Liège): http://www.nanomat.ulg.ac.be/ Nanomat lab, Q-Mat center, Université de Liège Département de Physique, Bat. B5a, 4/50 Allée du 6 août, 19 B-4000 Sart Tilman, Liège Belgium Phone : +32 4 366 90 17 European Theoretical Spectroscopy Facility (ETSF) Mail : ***@***.*** ***@***.*** ***@***.***

ejmeitz · 2025-01-23T14:56:33Z

I monitored the memory usage with htop when running. At most about 10 GB are used and I have 96 GB available so it really shouldn't be an OOM issue. There's definitely some memory being accessed that shouldn't be. Like f0, f1, ...fp should just be Nx3 arrays that are 5000 x 3 of Float32. Which is tiny.

I just ran valgrind, log is attached. I've never actually used valgrind before but the output does detect several invalid writes in the function that crashes when running gdb. Happy to run more tests if you have ideas on how to fix the bug based on the valgrind log.

valgrind_output.log

ejmeitz added the error label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ERROR] Segfault in extract_forceconstants when built with -Ofast #122

[ERROR] Segfault in extract_forceconstants when built with -Ofast #122

ejmeitz commented Jan 22, 2025 •

edited

Loading

mjv500 commented Jan 22, 2025 via email

ejmeitz commented Jan 22, 2025 •

edited

Loading

mjv500 commented Jan 23, 2025 via email

ejmeitz commented Jan 23, 2025

[ERROR] Segfault in extract_forceconstants when built with -Ofast #122

[ERROR] Segfault in extract_forceconstants when built with -Ofast #122

Comments

ejmeitz commented Jan 22, 2025 • edited Loading

mjv500 commented Jan 22, 2025 via email

ejmeitz commented Jan 22, 2025 • edited Loading

mjv500 commented Jan 23, 2025 via email

ejmeitz commented Jan 23, 2025

ejmeitz commented Jan 22, 2025 •

edited

Loading

ejmeitz commented Jan 22, 2025 •

edited

Loading