-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get poll to return task failure if job/log has been removed. #6577
Get poll to return task failure if job/log has been removed. #6577
Conversation
tests/unit/test_job_runner_mgr.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only test__job_poll_status_files_deleted_logdir
is directly related to the PR. Other tests should increase coverage. 😄
2661a36
to
01c7cb8
Compare
added unit tests for JobRunnerMgr._jobs_poll_status_files test the task_job_mgr end
01c7cb8
to
31fd08f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Note to reviewers, you will need to deploy this branch onto remote platforms to confirm it works for remote filesystems. |
Co-authored-by: Oliver Sanders <[email protected]>
54afde3
to
0c110b3
Compare
adeb3c0
to
b14c460
Compare
Co-authored-by: Ronnie Dutta <[email protected]>
Co-authored-by: Ronnie Dutta <[email protected]>
From the original issue:
This does not seem to be true. It is only true if the job log dir and the contact file is removed.
The job can succeed even if its job log dir is removed from under its feet. There seems to be a problem here where the job log retrieval process keeps retrying, preventing shutdown without the [runtime]
[[task]]
script = """
rm -r "${CYLC_WORKFLOW_RUN_DIR}/log/job/${CYLC_TASK_CYCLE_POINT}/${CYLC_TASK_NAME}"
"""
platform = <remote PBS>
execution time limit = PT1M
[[[directives]]]
-q = shared
-l ncpus = 1
-l mem = 100mb |
Hmmm, we've seen this with PBS several times. If you delete to job dir for a submitted task, it cannot run (no job script), so that'll definitely do it. For a running job, not so sure.
Possibly. But it will likely fail due to IO error and we cannot guarantee that Cylc will be informed of the job's outcome. ... From testing, it looks like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Code makes sense.
- Tested against the example in the OP, works well, error message clearly logged.
We should talk this one over in tomorrow's meeting.
Another example:
This task "fails successfully" (this PR as it currently stands makes no difference) but then the workflow hangs on shutdown after Edit: the length of the hang depends on the |
Discussed today:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some suggestions at wxtim#72
Simplify poll handling of prematurely deleted job log dir
I'm happy with your suggestions. You wield your scapel with a confident hand. I like it. |
…n_cylc_show * upstream/master: (27 commits) tui: open log files in external application (cylc#6611) fix typo in stop.py - options.max_poll -> options.max_polls schema: add first-parent descendants (cylc#6610) doc: improve NamespaceIDGlob description (cylc#6637) Document that "ex" prefix means "exclude" [skip ci] tests: remove string templating in SQL statements (cylc#6631) Merge pull request cylc#6629 from wxtim/tests.flake8 Remove vestiges of authorisation layer removed in cylc#3845 Bump dev version Prepare release 8.4.1 Merge pull request cylc#6578 from MetRonnie/graphql-err-handling Add test Update cylc/flow/etc/examples/expiry/index.rst Correct type annotations Fix duplicate task triggers Add test for duplicate task triggers Pytest: full verbosity for assertions Fix test picking up user global config Wrapper script: fix `PATH` override preventing selection of Cylc version in GUI under Cylc Hub Get poll to return task failure if job/log has been removed. (cylc#6577) ...
Closes #6425
Note
Note to reviewers, you will need to deploy this branch onto remote platforms to confirm it works for remote filesystems.
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).?.?.x
branch.