Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harvest exits on parse errors #25

Open
sdm7g opened this issue Jan 28, 2019 · 4 comments
Open

harvest exits on parse errors #25

sdm7g opened this issue Jan 28, 2019 · 4 comments

Comments

@sdm7g
Copy link

sdm7g commented Jan 28, 2019

If the harvester hits a parse error on the metadata payload, it quits with an exception, and without a simple way to restart after the bad record, there is no way to continue downloading the rest of the resource from that site. This is typically not an issue with oai_dc metadata, but I'm seeing it frequently with oai_ead feeds produced by ArchivesSpace.
I have found that back patching the parse method from oaipmh.client to create an XMLParser with recover=True option enables a more forgiving parser that doesn't crash. ( Typical sort of errors I've seen are unescaped ampersands and unpaired quotes around attribute values. )
I'm still exploring other modification: rather than silently working about the errors, I would probably like to log them, and perhaps also save the raw data so I can be run thru a validator to report problems upstream.
Just wondering if you have any thoughts about the best way to deal with this issue, both for my needs and what sort of solution you might consider accepting upstream.

@bloomonkey
Copy link
Owner

bloomonkey commented Feb 5, 2019

Hi

Can you maybe provide a sanitised traceback of the one of the errors you're seeing?

My first instinct is that the parse error is likely to be occurring somewhere in pyoai code, which oai-harvest uses for all the OAI-PMH protocol interactions. If so, it might be quite difficult to intercept, log, and output the raw data...

@sdm7g
Copy link
Author

sdm7g commented Feb 5, 2019

Yes, the problem is in pyoai.Client.parse code. I'm back patching that method via:
setattr(client.Client, 'parse', parse_patch )
and in order to both complete the harvest, but get full trace info to report upstream, I'm trying tree.XML(xml) first, and on etree.XMLSyntaxError , I'm trying etree.XML(xml, etree.XMLParser(recover=True)) and also writing out the raw feed for diagnosis of the problem:

    try:  
        return etree.XML( xml )
    except etree.XMLSyntaxError as perror:
        logger.error(perror)
        print( perror.error_log )
        try:
            tree = etree.XML(xml, etree.XMLParser(recover=True)) # sdm7g: attempt to RECOVER
            id = tree.xpath( '/oai:OAI-PMH/oai:ListRecords/oai:record/oai:header/oai:identifier', namespaces=self.getNamespaces() )
            logger.warning( "Recoverable parse error on: {0}.oai_ead.xml".format(id[0].text) )
        except Exception as perror2:
            logger.error(perror2)
        finally:
            raw_file = id[0].text + ".oai_raw.xml"
            logger.debug( "Writing raw feed as: " + raw_file )
            with open( raw_file, 'wb') as raw:
                raw.write( xml )            
        return tree

Which gives me, on error, for example:

DEBUG    Writing to file /usr/local/projects/Archivespace/OAI/oai:jmu//repositories/4/resources/212.oai_ead.xml
ERROR    xmlParseEntityRef: no name, line 63, column 259 (<string>, line 63)
WARNING  Recoverable parse error on: oai:jmu//repositories/4/resources/213.oai_ead.xml
DEBUG    Writing raw feed as: oai:jmu//repositories/4/resources/213.oai_raw.xml
DEBUG    Writing to file /usr/local/projects/Archivespace/OAI/oai:jmu//repositories/4/resources/213.oai_ead.xml
DEBUG    Writing to file /usr/local/projects/Archivespace/OAI/oai:jmu//repositories/4/resources/214.oai_ead.xml

Missing namespace prefix is another frequent XMLSyntax Error.

I'm still testing, but that seems to work for me ( except, if I use the '--dir' param, it seems to break on writing that raw file. Without that param it's OK. )

The id[0] may be an unwarranted assumption that there is only one record in the metadata payload, but that appears to be true for these feeds when requesting oai_ead metadata.
Not sure how to handle it if that is not the case, but I should probably at least check that it isn't the case. And I just realized it the 2nd exception occurs on etree.XML, id will not have been assigned.

But I don't expect many require this sort of work around.
So far,I don't see a cleaner way to handle and generalize it.

Or maybe the sensible way to generalize it is just to have two separate modes to run:
Recover mode, to normally harvest feeds, and Strict mode, for validating and checking feeds.

@bloomonkey
Copy link
Owner

I can sympathise - I spent over 10 years writing code to deal with what felt like an infinite variety of syntax errors in EAD files...

@sdm7g
Copy link
Author

sdm7g commented Apr 12, 2019

I've submitted a pull upstream to pyoai that add a recover=True option to lxml.etree.XMLParser used in Client.
( Issue was simpler once I understood lxml internals better. )

If that gets accepted, then I would submit a patch to fix the issue in harvest by adding that recover=True option to Client creation in harvest.py, and adding some code to log errors.

As I note in that pyoai pull, trying to parse a 2nd time on error is not needed.
( And I decided I didn't want to save the raw files at that point. It wasn't really needed if error logging was working correctly, but was useful for debugging. )

What is lost in making the code more general and correct is the assumption that I can know the particular identifier that fails. In the case of EAD from the feeds I've been testing, there is only one ead record in each OAI ListRecords payload, but there are multiple oai_dc records, so I have to settle for just knowing what request failed, and not necessarily, in the general case, what id was Mal-formed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants