Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(ERROR) (ProcessorStage) Failed to process dataset #427

Open
witxka opened this issue Oct 8, 2020 · 1 comment
Open

(ERROR) (ProcessorStage) Failed to process dataset #427

witxka opened this issue Oct 8, 2020 · 1 comment

Comments

@witxka
Copy link
Collaborator

witxka commented Oct 8, 2020

root@aiatlas171:/var/log/dkb/data4es-hourly.log

(95) 2020-10-07 10:24:52 (DEBUG) (ProcessorStage) Traceback (most recent call last):
(95) (==)   File "/home/dkb/dkb-test.git/Utils/Dataflow/data4es/run/../095_datasetInfoAMI/amiDatasets.py", line 167, in process
(95) (==)     amiPhysValues(data)
(95) (==)   File "/home/dkb/dkb-test.git/Utils/Dataflow/data4es/run/../095_datasetInfoAMI/amiDatasets.py", line 207, in amiPhysValues
(95) (==)     raise Exception("Unexpected AMI response: %s" % json_str)
(95) (==) Exception: Unexpected AMI response: {u'AMIMessage': [{u'commandArgs': [{u'args': [{u'@argName': u'logicalDatasetName', u'@argValue': u'mc16_13TeV.451853.MadGraphPythia8EvtGen_A14NNPDF23LO_HAHMggfZdZd4l_98_42.simul.HITS.e8247_e5984_s3126'}, {u'@argName': u'scope', u'@argValue': u'mc16_13TeV'}]}], u'executionTime': [{u'$': u'0.018 '}], u'time': [{u'$': u'2020-10-07  at 10:24:52 AM CEST'}], u'connection': [{u'@amiLogin': u'dkb', u'@type': u'router', u'$': u'successful'}], u'command': [{u'$': u'AMIGetPhysicsParamsForDataset'}], u'dumpCommand': [{u'param': [{u'@name': u'command', u'$': u'AMIGetPhysicsParamsForDataset'}, {u'@name': u'logicalDatasetName', u'$': u'mc16_13TeV.451853.MadGraphPythia8EvtGen_A14NNPDF23LO_HAHMggfZdZd4l_98_42.simul.HITS.e8247_e5984_s3126'}, {u'@name': u'scope', u'$': u'mc16_13TeV'}]}], u'error': [{u'$': u"You don't have rights to execute this command."}]}]}
(95) 2020-10-07 10:24:52 (ERROR) (ProcessorStage) Failed to process dataset 'mc16_13TeV.451878.MadGraphPythia8EvtGen_A14NNPDF23LO_HAHMggfZdZd4l_560_230.simul.HITS.e8248_e5984_s3126_tid22774608_00'
@mgolosova
Copy link
Collaborator

Related code: https://github.com/PanDAWMS/dkb/blob/master/Utils/Dataflow/data4es-nested/095_datasetInfoAMI/amiDatasets.py#L198-L201

This error does not interrupt the whole process, right? And since it's rare, it must be not about something we can fix on our side.

I suggest that we do about the following about this:

  1. Make it look a bit more accurate (not as a traceback):

    • in case of "unexpected response" (try to) extract 'error' response field and output it's value instead of a full traceback.
  2. Pause the process and retry a bit (30-60 seconds) later.
    Seems like in this case the error appeared in a short interval of time (2020-10-07 10:24:41-- 2020-10-07 10:24:52), so I believe it was due to restart of some service at the AMI server side.
    BTW, if we get the AMI server response, the pyAMI client doesn't try to query another instance, while in this case it might be our resque :(

  3. If the retry failed -- (properly) skip this message processing and go on:

    • mark the message as 'incomplete'.

The only problem here is that if the issue wasn't somehow fixed in a couple of minutes, and we're waiting for 30 seconds for every message passing through this stage -- the whole process will take very long time. Is it OK, or do we need some more elaborate scenario, like "if it fails N times in a row, stop retrying; just query AMI once for each new message and skip it if the problem's still here"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants