Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add easy way to iterate over warc records #14

Open
sirex opened this issue Sep 29, 2016 · 3 comments
Open

Add easy way to iterate over warc records #14

sirex opened this issue Sep 29, 2016 · 3 comments

Comments

@sirex
Copy link

sirex commented Sep 29, 2016

I was surprised that example provided in documentation:

>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('example/at.warc.gz')
>>> len(warc.records)

Reads everything into memory. And there is no easy way to iterate over records without loading everything into memory.

In my case, WARC files takes gigabytes of space, so I want to process those files record by record without loading everything into memory.

After reading sources I came up with this helper function:

import warcat.model


def readwarc(filename, types=('response',)):
    f = warcat.model.WARC.open(filename)
    has_more = True
    while has_more:
        record, has_more = warcat.model.WARC.read_record(f)
        if not types or record.warc_type in types:
            if isinstance(record.content_block, warcat.model.BlockWithPayload):
                yield record, record.content_block.payload.get_file
            elif hasattr(record.content_block, 'binary_block'):
                yield record, record.content_block.binary_block.get_file
            else:
                yield record, record.content_block.get_file


for record, content in readwarc('pages.warc.gz'):
    with content() as f:
        # process f

I think it would be really useful if Warcat would provide an interface for lazy iteration over whole WARC file. I would image it to look something like this:

import warcat

for record in warcat.readrecords('pages.warc.gz'):
    with record.content() as f:
        # process f

Also, if I could get lxml, BeautifulSoap and json from records, something like this:

for record in warcat.readrecords('pages.warc.gz'):
    record.lxml.xpath('//a')
    record.soap.select('a')
    record.json['a']

Then it would be really amazing.

If you agree with suggested API, I can create pull request with the implementation.

@chfoo
Copy link
Owner

chfoo commented Oct 5, 2016

Sure, I think that sounds great!

@ikbear
Copy link

ikbear commented Mar 22, 2020

Is there anything update for this suggestion? I encountered the same problem when loading large data:

warcat/util.py", line 66, in find_file_pattern raise ValueError('Search for pattern exhausted')
ValueError: Search for pattern exhausted

@nmccamish
Copy link

I am getting the same error as @ikbear, while running python -m warcat list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants