ParsingException when reading ClueWeb09 files
Closed this issue · 3 comments
Hi, I would like to use jwarc to parse the files in the ClueWeb09 collection. However, for some of the files (for instance, parts/ClueWeb09_English_1/en0000/09.warc.gz
), parsing fails with the following exception:
org.netpreserve.jwarc.ParsingException: invalid WARC record at position 79: ...rget-URI: http://2fered.tistory.com/tag/<-- HERE -->\x08\xffffffc3\xffffff80\xffffffc2\xffffffa4\xffffffc2\xffffffb8\xffffffc2\xffffffac\r\nWARC-Warcinfo-ID: 5fdd2301-6c...
at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:315)
at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:159)
...
There are some known encoding issues with the ClueWeb09 data (e.g. as discussed in this paper), so it sounds possible that that is the underlying issue here. Is there a way jwarc can deal with such 'dirty' web content? Or could it perhaps be caused by another issue?
Note: I just saw the very related issue #26, but it seems that is caused by something different. The ParsingException happens in the middle of the WARC-Target-URI
field, not at the end of a line at the CRLF characters.
Looks like there's no defense against invalid utf8 in the url? Not surprised to see that in a WARC, there are occasions in the past where Common Crawl has written such bad urls in our WARCs 😅
Glad that worked. I've released it as version 0.30.0, it should sync to maven central in an hour or so.