-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
harvest exits on parse errors #25
Comments
Hi Can you maybe provide a sanitised traceback of the one of the errors you're seeing? My first instinct is that the parse error is likely to be occurring somewhere in pyoai code, which oai-harvest uses for all the OAI-PMH protocol interactions. If so, it might be quite difficult to intercept, log, and output the raw data... |
Yes, the problem is in pyoai.Client.parse code. I'm back patching that method via:
Which gives me, on error, for example:
Missing namespace prefix is another frequent XMLSyntax Error. I'm still testing, but that seems to work for me ( except, if I use the '--dir' param, it seems to break on writing that raw file. Without that param it's OK. ) The But I don't expect many require this sort of work around. Or maybe the sensible way to generalize it is just to have two separate modes to run: |
I can sympathise - I spent over 10 years writing code to deal with what felt like an infinite variety of syntax errors in EAD files... |
I've submitted a pull upstream to pyoai that add a If that gets accepted, then I would submit a patch to fix the issue in harvest by adding that As I note in that pyoai pull, trying to parse a 2nd time on error is not needed. What is lost in making the code more general and correct is the assumption that I can know the particular identifier that fails. In the case of EAD from the feeds I've been testing, there is only one ead record in each OAI ListRecords payload, but there are multiple oai_dc records, so I have to settle for just knowing what request failed, and not necessarily, in the general case, what id was Mal-formed. |
If the harvester hits a parse error on the metadata payload, it quits with an exception, and without a simple way to restart after the bad record, there is no way to continue downloading the rest of the resource from that site. This is typically not an issue with oai_dc metadata, but I'm seeing it frequently with oai_ead feeds produced by ArchivesSpace.
I have found that back patching the parse method from oaipmh.client to create an XMLParser with recover=True option enables a more forgiving parser that doesn't crash. ( Typical sort of errors I've seen are unescaped ampersands and unpaired quotes around attribute values. )
I'm still exploring other modification: rather than silently working about the errors, I would probably like to log them, and perhaps also save the raw data so I can be run thru a validator to report problems upstream.
Just wondering if you have any thoughts about the best way to deal with this issue, both for my needs and what sort of solution you might consider accepting upstream.
The text was updated successfully, but these errors were encountered: