iipc/jwarc

ParsingException when reading ClueWeb09 files

Closed this issue · 3 comments

Hi, I would like to use jwarc to parse the files in the ClueWeb09 collection. However, for some of the files (for instance, parts/ClueWeb09_English_1/en0000/09.warc.gz), parsing fails with the following exception:

org.netpreserve.jwarc.ParsingException: invalid WARC record at position 79: ...rget-URI: http://2fered.tistory.com/tag/<-- HERE -->\x08\xffffffc3\xffffff80\xffffffc2\xffffffa4\xffffffc2\xffffffb8\xffffffc2\xffffffac\r\nWARC-Warcinfo-ID: 5fdd2301-6c...
    at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:315)
    at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:159)
    ...

There are some known encoding issues with the ClueWeb09 data (e.g. as discussed in this paper), so it sounds possible that that is the underlying issue here. Is there a way jwarc can deal with such 'dirty' web content? Or could it perhaps be caused by another issue?

Note: I just saw the very related issue #26, but it seems that is caused by something different. The ParsingException happens in the middle of the WARC-Target-URI field, not at the end of a line at the CRLF characters.

Looks like there's no defense against invalid utf8 in the url? Not surprised to see that in a WARC, there are occasions in the past where Common Crawl has written such bad urls in our WARCs 😅

@ato thanks for the incredibly quick fix in 9771f23! I have tested it on the WARC file mentioned above, and it now successfully parses the whole file.

ato commented

Glad that worked. I've released it as version 0.30.0, it should sync to maven central in an hour or so.