Add easy way to iterate over warc records #14

sirex · 2016-09-29T15:36:36Z

I was surprised that example provided in documentation:

>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('example/at.warc.gz')
>>> len(warc.records)

Reads everything into memory. And there is no easy way to iterate over records without loading everything into memory.

In my case, WARC files takes gigabytes of space, so I want to process those files record by record without loading everything into memory.

After reading sources I came up with this helper function:

import warcat.model


def readwarc(filename, types=('response',)):
    f = warcat.model.WARC.open(filename)
    has_more = True
    while has_more:
        record, has_more = warcat.model.WARC.read_record(f)
        if not types or record.warc_type in types:
            if isinstance(record.content_block, warcat.model.BlockWithPayload):
                yield record, record.content_block.payload.get_file
            elif hasattr(record.content_block, 'binary_block'):
                yield record, record.content_block.binary_block.get_file
            else:
                yield record, record.content_block.get_file


for record, content in readwarc('pages.warc.gz'):
    with content() as f:
        # process f

I think it would be really useful if Warcat would provide an interface for lazy iteration over whole WARC file. I would image it to look something like this:

import warcat

for record in warcat.readrecords('pages.warc.gz'):
    with record.content() as f:
        # process f

Also, if I could get lxml, BeautifulSoap and json from records, something like this:

for record in warcat.readrecords('pages.warc.gz'):
    record.lxml.xpath('//a')
    record.soap.select('a')
    record.json['a']

Then it would be really amazing.

If you agree with suggested API, I can create pull request with the implementation.

chfoo · 2016-10-05T21:01:25Z

Sure, I think that sounds great!

ikbear · 2020-03-22T11:57:22Z

Is there anything update for this suggestion? I encountered the same problem when loading large data:

warcat/util.py", line 66, in find_file_pattern raise ValueError('Search for pattern exhausted')
ValueError: Search for pattern exhausted

nmccamish · 2024-09-30T21:07:06Z

I am getting the same error as @ikbear, while running python -m warcat list.

chfoo added the enhancement label Oct 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add easy way to iterate over warc records #14

Add easy way to iterate over warc records #14

sirex commented Sep 29, 2016

chfoo commented Oct 5, 2016 •

edited

Loading

ikbear commented Mar 22, 2020

nmccamish commented Sep 30, 2024

Add easy way to iterate over warc records #14

Add easy way to iterate over warc records #14

Comments

sirex commented Sep 29, 2016

chfoo commented Oct 5, 2016 • edited Loading

ikbear commented Mar 22, 2020

nmccamish commented Sep 30, 2024

chfoo commented Oct 5, 2016 •

edited

Loading