-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental harvesting? #22
Comments
Hi TL;DR - internally it does harvest incrementally in batches (or pages), but unfortunately I don't think that there's a simple way to control batch sizes or starting points. I don't think that there is a simple way to support this, as it is not actually supported in the OAI-PMH protocol itself. Selective harvesting in OAI-PMH only supports retrieving records based on when they were added/updated/deleted or by requesting a specific set, or a combination of these. Batching, or paging, in a ListRecords request in OAI-PMH is controlled by the server. The server decides the number of records it will deliver in each page, and provides a Also, my interpretation of the OAI-PMH standard section on flow control is that there's no requirement that the server returns records in a consistent order, except through the use of a |
Thank you for your quick response. As a workaround, I tried simply suspending the oai-harvest process by sending Maybe I can modify oai-harvest to postpone fetching the next page until some interval has passed, when some minimum number of records has been fetched (but only after all records that the server wants to include in the page have been processed). The interval and the minimum number of records could be passed on the command line. For my use case, this would be good enough (I don't need a strict limit on the number of records, I just need to ensure that harvesting only happens during the night). Does this sound workable? |
I looked at the code and I think I found a way to add this functionality. I'd add some optional arguments to the argparser in oai-harvest/oaiharvest/harvest.py Lines 303 to 306 in 0ab56f6
which in turn gets passed to oai-harvest/oaiharvest/harvest.py Lines 91 to 97 in 0ab56f6
when the limit is passed, I reset the counter if applicable and then simply let the process sleep for the set amount of time. This will likely cause the process to sleep in the middle of a page, but looking through the pyoai code, it appears this shouldn't be a problem because each resumption request is fully handled before |
So if I understand, you're trying to harvest a set so big that it takes more than about 8 hours, but you need to harvest only at night? To be honest, it might be easiest to write a subclass of import codecs
import logging
import time
from datetime import datetime
from oaiharvest.harvest import DirectoryOAIHarvester
class NightTimeDirectoryOAIHarvester(DirectoryOAIHarvester):
def harvest(self, baseUrl, metadataPrefix, **kwargs):
logger = logging.getLogger(__name__).getChild(self.__class__.__name__)
for header, metadata, about in self._listRecords(
baseUrl,
metadataPrefix=metadataPrefix,
**kwargs):
# This is quick and dirty, you could calculate the exact time to sleep for instead
hour = datetime.now().hour
while 6 <= hour < 22:
logger.debug("not night-time, sleeping for 10 mins...")
time.sleep(600) # Sleep for 10 minutes
hour = datetime.now().hour
fp = self._get_output_filepath(header, metadataPrefix)
self._ensure_dir_exists(fp)
if not header.isDeleted():
logger.debug('Writing to file {0}'.format(fp))
with codecs.open(fp, "w", encoding="utf-8") as fh:
fh.write(metadata)
else:
if self.respectDeletions:
logger.debug("Respecting server request to delete file "
"{0}".format(fp))
try:
os.remove(fp)
except OSError:
# File probably does't exist in destination directory
# No further action needed
pass
else:
logger.debug("Ignoring server request to delete file "
"{0}".format(fp))
else:
# Harvesting completed, all available records stored
return True
# Loop must have been stopped with ``break``, e.g. due to
# arbitrary limit
return False
# Set up metadata registry
xmlReader = XMLMetadataReader()
metadata_registry = DefaultingMetadataRegistry(defaultReader=xmlReader)
harvester = NightTimeDirectoryOAIHarvester(metadata_registry, "/path/to/store/files")
harvester.harvest(
"http://oaipmh.example.com/oaipmh/2.0/",
metadataPrefix="oai_dc",
set="SET") If you do want to change the command line arguments, updating the whole CLI to use Click would be a nice improvement. |
Yes. Though I should add that I'm not only harvesting the metadata, but also additional data linked from within. I'll do the latter with a separate script.
That looks like more code than I was planning to write. Also, I'm being paid from public money, so I'd prefer to make it reusable for the rest of the world.
Maybe, but I'm not being paid to do that. You are already using I'm now going to work on what I last proposed. I'll test it tonight and probably send you a pull request tomorrow or Thursday. |
Dear @bloomonkey, I'd like to harvest a big set incrementally. My impression is that oai-harvest does not support this scenario. If I run
oai-harvest -s SET -l 3 provider
twice, it will simply download the same records twice.Is there a simple way in which I could modify oai-harvest in order to support this? I'd be happy to submit a pull request. Alternatively, are you aware of a tool that already supports this functionality? Thanks in advance.
The text was updated successfully, but these errors were encountered: