Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Handle unresponsive sacct #5

Merged
merged 3 commits into from
Nov 23, 2023

Conversation

fgvieira
Copy link
Contributor

@fgvieira fgvieira commented Nov 15, 2023

Fix snakemake/snakemake#2411 (reposting PR snakemake/snakemake#2413 on new repo)

When sacct is non responsive (and there is a timeout), snakemake currently exits with an error. This PR aims at properly handling the timeout by trying again. Not sure if it should wait a bit more before querying sacct again.

EDIT: some more info

The job status query failed with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --name 05969656-0e62-47f1-9008-2a189069f0a7
Error message: sacct: error: get_addr_info: getaddrinfo() failed: Name or service not known
sacct: error: slurm_set_addr: Unable to resolve "db01fl"
sacct: error: slurm_get_port: Address family '0' not supported
sacct: error: Error connecting, bad data: family = 0, port = 0
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:db01fl:6819: Resource temporarily unavailable
sacct: error: Sending PersistInit msg: Resource temporarily unavailable
sacct: error: Problem talking to the database: Resource temporarily unavailable

Traceback (most recent call last):
  File "/envs/snakemake_env/lib/python3.11/site-packages/snakemake/executors/__init__.py", line 886, in _wait_thread
    asyncio.run(self._wait_for_jobs())
  File "/envs/snakemake_env/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/envs/snakemake_env/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/envs/snakemake_env/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/envs/snakemake_env/lib/python3.11/site-packages/snakemake/executors/slurm/slurm_submit.py", line 399, in _wait_for_jobs
    (status_of_jobs, sacct_query_duration) = await self.job_stati(
                                             ^^^^^^^^^^^^^^^^^^^^^
  File "/envs/snakemake_env/lib/python3.11/site-packages/snakemake/executors/slurm/slurm_submit.py", line 330, in job_stati
    return (res, query_duration)
            ^^^
UnboundLocalError: cannot access local variable 'res' where it is not associated with a value
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

@fgvieira
Copy link
Contributor Author

Not sure if there is a wait time between two sacct queries, but it might be a good idea (specially if the DB is temporarily unavailable).

@johanneskoester
Copy link
Contributor

Not sure if there is a wait time between two sacct queries, but it might be a good idea (specially if the DB is temporarily unavailable).

There is one via the rate limiter. Maybe that is sufficient for now.

@johanneskoester johanneskoester merged commit 2f7ec1b into snakemake:main Nov 23, 2023
4 checks passed
@fgvieira fgvieira deleted the sacct_fail branch November 23, 2023 12:19
johanneskoester pushed a commit that referenced this pull request Dec 6, 2023
🤖 I have created a release *beep* *boop*
---


##
[0.1.3](v0.1.2...v0.1.3)
(2023-12-06)


### Bug Fixes

* Handle unresponsive sacct
([#5](#5))
([2f7ec1b](2f7ec1b))


### Documentation

* update author encoding
([890bdb0](890bdb0))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
dlaehnemann added a commit to dlaehnemann/snakemake-executor-plugin-lsf that referenced this pull request Jan 10, 2025
This solution is adapted from the same issue in snakemake-executor-plugin-slurm:
snakemake/snakemake-executor-plugin-slurm#5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SACCT Error.
2 participants