chfoo/warcat

Add easy way to iterate over warc records

sirex opened this issue · 2 comments

sirex commented

I was surprised that example provided in documentation:

>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('example/at.warc.gz')
>>> len(warc.records)

Reads everything into memory. And there is no easy way to iterate over records without loading everything into memory.

In my case, WARC files takes gigabytes of space, so I want to process those files record by record without loading everything into memory.

After reading sources I came up with this helper function:

import warcat.model


def readwarc(filename, types=('response',)):
    f = warcat.model.WARC.open(filename)
    has_more = True
    while has_more:
        record, has_more = warcat.model.WARC.read_record(f)
        if not types or record.warc_type in types:
            if isinstance(record.content_block, warcat.model.BlockWithPayload):
                yield record, record.content_block.payload.get_file
            elif hasattr(record.content_block, 'binary_block'):
                yield record, record.content_block.binary_block.get_file
            else:
                yield record, record.content_block.get_file


for record, content in readwarc('pages.warc.gz'):
    with content() as f:
        # process f

I think it would be really useful if Warcat would provide an interface for lazy iteration over whole WARC file. I would image it to look something like this:

import warcat

for record in warcat.readrecords('pages.warc.gz'):
    with record.content() as f:
        # process f

Also, if I could get lxml, BeautifulSoap and json from records, something like this:

for record in warcat.readrecords('pages.warc.gz'):
    record.lxml.xpath('//a')
    record.soap.select('a')
    record.json['a']

Then it would be really amazing.

If you agree with suggested API, I can create pull request with the implementation.

chfoo commented

Sure, I think that sounds great!

Is there anything update for this suggestion? I encountered the same problem when loading large data:

warcat/util.py", line 66, in find_file_pattern raise ValueError('Search for pattern exhausted')
ValueError: Search for pattern exhausted