Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some more error recovery code in CIVET #164

Open
prioux opened this issue Oct 21, 2020 · 0 comments
Open

Add some more error recovery code in CIVET #164

prioux opened this issue Oct 21, 2020 · 0 comments
Assignees

Comments

@prioux
Copy link
Member

prioux commented Oct 21, 2020

There are circumstances on a cluster where an interrupted CIVET task leaves the files of the pipeline in an illegal state. This shows after a restart in the main processing logs with a message at the end that looks like:

A conflict has been found with the specification of the 
inputs and the prereqs in this stage:
======= 1402039_ses2: 29: verify_convergence ========
Inputs: [files files files]
Outputs: [files files]
Args: [blah blah]
Prereqs: mid_surface_left mid_surface_right
Status: not processed

All prereqs are finished but input /scratch/cbrain01/BourreauGrahamPlatform/GridShare/145/40/88/admin_prioux-Civet-T1454088/civet_out/1402039_ses2/logs/1402039_ses2.gray_surface_left.log does not exist.

The critical indication is the last line. For each stage in the pipeline, both a .finished and a .log files must exist in the logs/ subdirectory. In the case of a badly interrupted CIVET, only one exists and the pipeline cannot recover.

The fix is to go to the logs subdirectory an make sure the files are paired. For each .finished file there must be a log. Listing by timestamps usually shows this clearly as two files with size 0 side by side:

> ls -ltr | tail
-rw-r--r-- 1 cbrain01 rpp-aevans-ab  28383 Oct 18 02:44 1402039_ses2.laplace_field.log
-rw-r--r-- 1 cbrain01 rpp-aevans-ab      0 Oct 18 02:44 1402039_ses2.laplace_field.finished
-rw-r--r-- 1 cbrain01 rpp-aevans-ab      0 Oct 18 04:03 1402039_ses2.gray_surface_left.finished
-rw-r--r-- 1 cbrain01 rpp-aevans-ab 543487 Oct 18 05:22 1402039_ses2.gray_surface_right.log
-rw-r--r-- 1 cbrain01 rpp-aevans-ab      0 Oct 18 05:22 1402039_ses2.gray_surface_right.finished
-rw-r--r-- 1 cbrain01 rpp-aevans-ab   1010 Oct 18 05:22 1402039_ses2.mid_surface_left.log
-rw-r--r-- 1 cbrain01 rpp-aevans-ab      0 Oct 18 05:22 1402039_ses2.mid_surface_left.finished
-rw-r--r-- 1 cbrain01 rpp-aevans-ab   1015 Oct 18 05:22 1402039_ses2.mid_surface_right.log
-rw-r--r-- 1 cbrain01 rpp-aevans-ab      0 Oct 18 05:22 1402039_ses2.mid_surface_right.finished
-rw-r--r-- 1 cbrain01 rpp-aevans-ab   1597 Oct 21 09:25 1402039_ses2.options

so the cleanup step in this instance requires removing the file '1402039_ses2.gray_surface_left.finished`

This could be made automatic in the CBRAIN CIVET integration in the recovery code.

@prioux prioux self-assigned this Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant