Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WMAgent: SiteLists updates does not propagate to local workqueue elements #12244

Open
todor-ivanov opened this issue Jan 29, 2025 · 1 comment · May be fixed by #12245
Open

WMAgent: SiteLists updates does not propagate to local workqueue elements #12244

todor-ivanov opened this issue Jan 29, 2025 · 1 comment · May be fixed by #12245
Assignees
Labels

Comments

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Jan 29, 2025

Impact of the bug
WMagent

Describe the bug
While testing the SiteLists changes propagation down to the local workqueue elements at the WMAgent togather with @mapellidario we found out that there was no change triggered at the local Workqeue by a workflow parameters update in WMSatats and the respective global Workqueue Elements update. In our initial investigation we found out few reasons for this:

  • Few typos in the workflow parameters names
  • A too narrowed mask for Local Workqueue Elements statuses to be considered for update
  • A broken mechanism for fetching the CouchDB url for the workflow spec
  • The workload object for the local workqueue was loaded from the spec instead of localcouchdb, which
  • The workload.specUrl() method was not agnostic to the source from where the workfload object have been created, which was resutling in an exception of the type: [1], while trying to preserve the changes in local couch
  • Even upon fixing the way how we set instantiate the workload object and calling the saveCouch method properly we were still facing an Unauthorised error because the spec url returned by the method above was sanitized during the LocalWorkQueue object creation and the username and password removed from the url
  • We found few redundant operations for calling the workload setters methods upon updating the workqueue elments, which is already done sequentially through the procedures of updating the workqueue elements as widely discussed in the issue and the implementation for the GlobalWorkQueu elements update:

How to reproduce it
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Additional context and error message

This issues were found while validating the sitewhitelist/siteblacklist dynamic change in view of the upcoming central services and wmagent release candidates #12222 #12224 . it is a followup to the PR #12123


[1]

In [1]: sitelistpoller.algorithm()
2025-01-28 12:50:59,981:INFO:SiteListPoller:algorithm(): Active workflows: dict_keys(['dmapelli_SC_EL8_JSON_Nvidia_test_v1_250124_095025_2857', 'dmapelli_ReReco_RunBlockWhite_Nvidia_test_v1_250124_095017_1803', 'dmapelli_TaskChain_ProdMinBias_Nvidia_test_v1_250124_095038_1012'])
2025-01-28 12:50:59,982:INFO:SiteListPoller:wmstatsDict(): Fetch site info from WMStats for condition: {'RequestStatus': 'running-closed'} and mask ['SiteWhitelist', 'SiteBlacklist']
2025-01-28 12:51:00,075:INFO:SiteListPoller:wmstatsDict(): Fetch site info from WMStats for condition: {'RequestStatus': 'running-open'} and mask ['SiteWhitelist', 'SiteBlacklist']
2025-01-28 12:51:00,162:INFO:SiteListPoller:wmstatsDict(): Fetch site info from WMStats for condition: {'RequestStatus': 'acquired'} and mask ['SiteWhitelist', 'SiteBlacklist']
2025-01-28 12:51:00,248:INFO:SiteListPoller:algorithm(): 
wdict: {'dmapelli_ReReco_RunBlockWhite_Nvidia_test_v1_250124_095017_1803': {'SiteBlacklist': ['T2_AT_Vienna',
                                                                                       'T2_BE_IIHE'],
                                                                     'SiteWhitelist': ['T1_US_FNAL',
                                                                                       'T2_CH_CERN']},
 'dmapelli_SC_EL8_JSON_Nvidia_test_v1_250124_095025_2857': {'SiteBlacklist': [],
                                                            'SiteWhitelist': ['T1_US_FNAL',
                                                                              'T2_CH_CERN']},
 'dmapelli_TaskChain_ProdMinBias_Nvidia_test_v1_250124_095038_1012': {'SiteBlacklist': [],
                                                                      'SiteWhitelist': ['T1_US_FNAL',
                                                                                        'T2_CH_CERN']}}

2025-01-28 12:51:00,286:INFO:SiteListPoller:algorithm(): Updating dmapelli_ReReco_RunBlockWhite_Nvidia_test_v1_250124_095017_1803:
2025-01-28 12:51:00,286:INFO:SiteListPoller:algorithm():   siteWhitelist ['T1_US_FNAL', 'T2_CH_CERN'] => ['T1_US_FNAL', 'T2_CH_CERN']
2025-01-28 12:51:00,286:INFO:SiteListPoller:algorithm():   siteBlacklist [] => ['T2_AT_Vienna', 'T2_BE_IIHE']
2025-01-28 12:51:00,295:ERROR:SiteListPoller:algorithm(): Unexpected exception while updating elements in local workqueue Details:
You must include http(s):// in your servers address
Traceback (most recent call last):
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMComponent/WorkflowUpdater/SiteListPoller.py", line 133, in algorithm
    self.localWQ.updateElementsByWorkflow(wHelper, params, status=['Available'])
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/WorkQueue/WorkQueue.py", line 290, in updateElementsByWorkflow
    workload.saveCouchUrl(workload.specUrl())
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/Persistency.py", line 124, in saveCouchUrl
    return self.saveCouch(couchUrl, dbname)
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/Persistency.py", line 84, in saveCouch
    server = CouchServer(couchUrl)
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Database/CMSCouch.py", line 967, in __init__
    check_server_url(dburl)
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Database/CMSCouch.py", line 46, in check_server_url
    raise ValueError('You must include http(s):// in your servers address')
ValueError: You must include http(s):// in your servers address
Out[1]: (0.7039, None, 'algorithm')

[2]

In [1]: sitelistpoller.algorithm()
2025-01-28 17:04:16,958:INFO:SiteListPoller:algorithm(): Active workflows: dict_keys(['dmapelli_SC_EL8_JSON_Nvidia_test_v1_250124_095025_2857', 'dmapelli_ReReco_RunBlockWhite_Nvidia_test_v1_250124_095017_1803', 'dmapelli_TaskChain_ProdMinBias_Nvidia_test_v1_250124_095038_1012'])
2025-01-28 17:04:16,958:INFO:SiteListPoller:wmstatsDict(): Fetch site info from WMStats for condition: {'RequestStatus': 'running-closed'} and mask ['SiteWhitelist', 'SiteBlacklist']
2025-01-28 17:04:17,054:INFO:SiteListPoller:wmstatsDict(): Fetch site info from WMStats for condition: {'RequestStatus': 'running-open'} and mask ['SiteWhitelist', 'SiteBlacklist']
2025-01-28 17:04:17,142:INFO:SiteListPoller:wmstatsDict(): Fetch site info from WMStats for condition: {'RequestStatus': 'acquired'} and mask ['SiteWhitelist', 'SiteBlacklist']
2025-01-28 17:04:17,229:INFO:SiteListPoller:algorithm(): 
wdict: {'dmapelli_ReReco_RunBlockWhite_Nvidia_test_v1_250124_095017_1803': {'SiteBlacklist': ['T1_IT_CNAF',
                                                                                       'T1_RU_JINR',
                                                                                       'T1_UK_RAL'],
                                                                     'SiteWhitelist': ['T1_DE_KIT']},
 'dmapelli_SC_EL8_JSON_Nvidia_test_v1_250124_095025_2857': {'SiteBlacklist': [],
                                                            'SiteWhitelist': ['T1_US_FNAL',
                                                                              'T2_CH_CERN']},
 'dmapelli_TaskChain_ProdMinBias_Nvidia_test_v1_250124_095038_1012': {'SiteBlacklist': [],
                                                                      'SiteWhitelist': ['T1_US_FNAL',
                                                                                        'T2_CH_CERN']}}

2025-01-28 17:04:17,270:INFO:SiteListPoller:algorithm(): Updating dmapelli_ReReco_RunBlockWhite_Nvidia_test_v1_250124_095017_1803:
2025-01-28 17:04:17,270:INFO:SiteListPoller:algorithm():   siteWhitelist ['T1_US_FNAL', 'T2_CH_CERN'] => ['T1_DE_KIT']
2025-01-28 17:04:17,270:INFO:SiteListPoller:algorithm():   siteBlacklist ['T2_BE_IIHE', 'T0_CH_CSCS_HPC', 'T2_AT_Vienna'] => ['T1_IT_CNAF', 'T1_RU_JINR', 'T1_UK_RAL']
2025-01-28 17:04:17,282:ERROR:SiteListPoller:algorithm(): Unexpected exception while updating elements in local workqueue Details:
Error type: CouchUnauthorisedError, Status code: 401, Reason: Unauthorized, Data: {}
Traceback (most recent call last):
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Database/CMSCouch.py", line 133, in makeRequest
    result, status, reason, cached = JSONRequests.makeRequest(
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/Requests.py", line 185, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/Requests.py", line 202, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
  File "/data/WMAgent.venv3/srv/WMCore/src/python/Utils/PortForward.py", line 68, in portMangle
    return callFunc(callObj, url, *args, **kwargs)
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/pycurl_manager.py", line 353, in request
    raise exc
http.client.HTTPException: url=http://127.0.0.1:5984/_all_dbs, code=401, reason=Unauthorized, headers={'Cache-Control': 'must-revalidate', 'Content-Length': '64', 'Content-Type': 'application/json', 'Date': 'Tue, 28 Jan 2025 16:04:17 GMT', 'Server': 'CouchDB/3.2.2 (Erlang OTP/23)', 'X-Couch-Request-ID': '8c9c8e9307', 'X-CouchDB-Body-Time': '0'}, result=b'{"error":"unauthorized","reason":"You are not a server admin."}\n'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMComponent/WorkflowUpdater/SiteListPoller.py", line 137, in algorithm
    self.localWQ.updateElementsByWorkflow(wHelper, params, status=['Available'])
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/WorkQueue/WorkQueue.py", line 291, in updateElementsByWorkflow
    workload.saveCouchUrl(workload.specUrl())
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/Persistency.py", line 124, in saveCouchUrl
    return self.saveCouch(couchUrl, dbname)
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/Persistency.py", line 85, in saveCouch
    database = server.connectDatabase(couchDBName)
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Database/CMSCouch.py", line 1013, in connectDatabase
    if create and dbname not in self.listDatabases():
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Database/CMSCouch.py", line 982, in listDatabases
    return self.get('/_all_dbs')
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/Requests.py", line 146, in get
    return self.makeRequest(uri, data, 'GET', incoming_headers,
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Database/CMSCouch.py", line 137, in makeRequest
    self.checkForCouchError(getattr(e, "status", None),
  File "/data/WMAgent.venv3/srv/WMCore/src/python/WMCore/Database/CMSCouch.py", line 153, in checkForCouchError
    raise CouchUnauthorisedError(reason, data, result, status)
WMCore.Database.CMSCouch.CouchUnauthorisedError: Error type: CouchUnauthorisedError, Status code: 401, Reason: Unauthorized, Data: {}
Out[1]: (0.6922, None, 'algorithm')

@amaltaro
Copy link
Contributor

@todor-ivanov @mapellidario you just noticed you left some placeholders in the initial description. Can you please update those as well?
If updating the issue template https://github.com/dmwm/WMCore/blob/master/.github/ISSUE_TEMPLATE/bug_report.md gives us a better experience and usability, please bring this up and we might change it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants