(ERROR) (ProcessorStage) Failed to process dataset #427

witxka · 2020-10-08T12:48:24Z

root@aiatlas171:/var/log/dkb/data4es-hourly.log

(95) 2020-10-07 10:24:52 (DEBUG) (ProcessorStage) Traceback (most recent call last):
(95) (==)   File "/home/dkb/dkb-test.git/Utils/Dataflow/data4es/run/../095_datasetInfoAMI/amiDatasets.py", line 167, in process
(95) (==)     amiPhysValues(data)
(95) (==)   File "/home/dkb/dkb-test.git/Utils/Dataflow/data4es/run/../095_datasetInfoAMI/amiDatasets.py", line 207, in amiPhysValues
(95) (==)     raise Exception("Unexpected AMI response: %s" % json_str)
(95) (==) Exception: Unexpected AMI response: {u'AMIMessage': [{u'commandArgs': [{u'args': [{u'@argName': u'logicalDatasetName', u'@argValue': u'mc16_13TeV.451853.MadGraphPythia8EvtGen_A14NNPDF23LO_HAHMggfZdZd4l_98_42.simul.HITS.e8247_e5984_s3126'}, {u'@argName': u'scope', u'@argValue': u'mc16_13TeV'}]}], u'executionTime': [{u'$': u'0.018 '}], u'time': [{u'$': u'2020-10-07  at 10:24:52 AM CEST'}], u'connection': [{u'@amiLogin': u'dkb', u'@type': u'router', u'$': u'successful'}], u'command': [{u'$': u'AMIGetPhysicsParamsForDataset'}], u'dumpCommand': [{u'param': [{u'@name': u'command', u'$': u'AMIGetPhysicsParamsForDataset'}, {u'@name': u'logicalDatasetName', u'$': u'mc16_13TeV.451853.MadGraphPythia8EvtGen_A14NNPDF23LO_HAHMggfZdZd4l_98_42.simul.HITS.e8247_e5984_s3126'}, {u'@name': u'scope', u'$': u'mc16_13TeV'}]}], u'error': [{u'$': u"You don't have rights to execute this command."}]}]}
(95) 2020-10-07 10:24:52 (ERROR) (ProcessorStage) Failed to process dataset 'mc16_13TeV.451878.MadGraphPythia8EvtGen_A14NNPDF23LO_HAHMggfZdZd4l_560_230.simul.HITS.e8248_e5984_s3126_tid22774608_00'

The text was updated successfully, but these errors were encountered:

mgolosova · 2020-10-21T11:00:06Z

Related code: https://github.com/PanDAWMS/dkb/blob/master/Utils/Dataflow/data4es-nested/095_datasetInfoAMI/amiDatasets.py#L198-L201

This error does not interrupt the whole process, right? And since it's rare, it must be not about something we can fix on our side.

I suggest that we do about the following about this:

Make it look a bit more accurate (not as a traceback):
- in case of "unexpected response" (try to) extract 'error' response field and output it's value instead of a full traceback.
Pause the process and retry a bit (30-60 seconds) later.
Seems like in this case the error appeared in a short interval of time (2020-10-07 10:24:41-- 2020-10-07 10:24:52), so I believe it was due to restart of some service at the AMI server side.
BTW, if we get the AMI server response, the pyAMI client doesn't try to query another instance, while in this case it might be our resque :(
If the retry failed -- (properly) skip this message processing and go on:
- mark the message as 'incomplete'.

The only problem here is that if the issue wasn't somehow fixed in a couple of minutes, and we're waiting for 30 seconds for every message passing through this stage -- the whole process will take very long time. Is it OK, or do we need some more elaborate scenario, like "if it fails N times in a row, stop retrying; just query AMI once for each new message and skip it if the problem's still here"?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(ERROR) (ProcessorStage) Failed to process dataset #427

(ERROR) (ProcessorStage) Failed to process dataset #427

witxka commented Oct 8, 2020

mgolosova commented Oct 21, 2020

(ERROR) (ProcessorStage) Failed to process dataset #427

(ERROR) (ProcessorStage) Failed to process dataset #427

Comments

witxka commented Oct 8, 2020

mgolosova commented Oct 21, 2020