Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider AAA flags when adding input data to WMBS #12212

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amaltaro
Copy link
Contributor

Fixes #11501

Status

not-tested

Description

To avoid workqueue elements failing to be acquired from LQE to WMBS in WorkQueueManager, fallback data location to the output of possibleSites() from https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/DataStructs/WorkQueueElement.py#L18 whenever a Rucio rule is not fully satisfied (and AAA has been enabled for the workflow).

Without this change, LQE will keep failing to be acquired by WMBS until data converges to the destination (likely after someone identifies it, discuss with the DM team and actions are taken...).

This development may have one potential drawback though, in case data is not available anywhere on Disk, jobs will likely fail to read the input data. Hopefully data will converge before job retries are exhausted. Otherwise, ACDC documents will be created reporting data location as the expected location for that file (usually a logical OR used of RSEs).

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

@amaltaro
Copy link
Contributor Author

I am still unsure whether this should at some point be merged or not, as there might be undesired behaviors for some specific scenarios, for instance:

  • does not affect data processing with input AAA flag disabled (called TrustSitelists in the spec). Whether the workflow is configured to start with partial data availability or not
  • however, data processing with with input AAA flag enabled would be affected, as LQE would be pulled by the agent as soon as priority+resources are available, potentially processing data not yet available on Disk. Differently, the current behavior would hit that exception and not let data be persisted in WMBS as long as the full block (or container) is not available in at least 1 Disk endpoint.

In any case, I want to apply this to submit10, as being discussed here: #12210

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 2 warnings
    • 38 comments to review
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/236/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

@hassan11196 @hassan11196 I wonder if you managed to take a look into this? If you have any questions or suggestions, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflow assigned with TrustSitelists=True can fail at WorkQueueManager if files have not replicated
2 participants