Skip to content
This repository has been archived by the owner on Sep 12, 2024. It is now read-only.

Unified modules and their roles

Hasan Öztürk edited this page Apr 8, 2021 · 1 revision

Checkor

Checkor module checks the workflows in completed status in ReqMgr2.

  1. completed to closed-out transition:

    Calculate the expected and observed statistics for the outputs of the workflow in terms of lumisections. If the observed statistics are greater than or equal to the threshold (fractionpass), then the workflow is moved to closed-out status in ReqMgr2. This means that the workflow produced the satisfactory results for all the outputs. If the observed statistics are not satisfactory, then the workflow is labeled with assistance tag, meaning that the workflow requires manual intervention to tackle the issues that it had. The fractionpass is 100% by default, but it can be overwritten in the campaign level. For instance, most MC workflows have 95% fractionpass. There is also some extra logic in the module which might lower the fractionpass if a certain criteria is met.

  2. Assistance labeling

    As mentioned above, if the workflow did not reach to the satisfactory results, then it stays in completed status and it's labeled with several assistance tags. These tags show which kind of issue the workflow has and in which level of resubmission (ACDC) it is.

  3. Output lumisection size check:

    Both too small and too big lumisections are problematic. This module checks for both too small and big lumisections. The lower limit is determined in Unified Configuration file. If the events/lumi of an output is lower than this value, then workflow is tagged with assistance-smalllumi label and a human checks the workflow.

    The upper limit is determined in the campaign level. If it lumi_size is -1, then this means that there is no limit. If the events/lumi is greater than the upper limit, then the workflow is tagged with assistance-biglumi and a human checks the workflow.

  4. Filemismatch check:

    For each output dataset, the module checks if the number of files in DBS matches with that of Rucio. If it does not match, then the workflow is tagged with assistance-filemismatch label and a human checks the workflow.

    Note that there is a delay between file injection to DBS and Rucio in WMAgent, which causes a filemismatch temporarily. In this scenario, the workflow is tagged with assistance-agentfilemismatch label and if the filemismatch is not resolved within 2 days, then the workflow is moved to assistance-filemismatch

  5. [CURRENTLY DISABLED] Duplicate check:

    For each output, the module queries DBS and checks for duplicate events. In case of duplicate events, it invalidates the file(s) which is/are causing the duplicate.

    Since this is a very expensive and heavy operation, this feature is currently disabled.

  6. Invalid file(s) check:

    If the number of invalid files in the output is above a threshold, then the workflow is tagged with assistance-invalidfiles and a human checks the workflow.

  7. Create/Update JIRA ticket

    Based on the checks done within the module, a JIRA ticket is created/updated automatically.

  8. Create a lumisection summary webpage:

    A webpage is created which shows the lumisections: E.g. https://cms-unified.web.cern.ch/cms-unified/datalumi/lumi.ReReco-Run2017C-JetHT-UL2017_MiniAODv1_NanoAODv2_pilot4-00001.html

  9. Create notifications for the requestors:

    Create a notification for the requestors about the issues that the workflow is having.

Clone this wiki locally