iipc/jwarc

UncheckedIOException, invalid WARC record error

gleporeNARA opened this issue · 3 comments

For this file I'm getting an "invalid WARC record" error.

https://eotarchive.s3.amazonaws.com/crawl-data/EOT-2004/segments/NARA-000/warc/NARA-PEOT-2004-20041015060312-00241-crawling009-c_NARA-PEOT-2004-20041015071841-00279-crawling009.archive.org.arc.gz

Here's the error:

Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.warc.WARCParser@5e85c21b
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:312)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:1069)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:493)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:256)
Caused by: java.io.UncheckedIOException: org.netpreserve.jwarc.ParsingException: invalid WARC record at position 71: ...og/lm_requestform.cfm?cFileno=WEL 802.02<-- HERE -->(B)&cDocno=LTSM00012740&cLoc=Internet%20...
at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:329)
at org.apache.tika.parser.warc.WARCParser.parse(WARCParser.java:88)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
... 5 more
Caused by: org.netpreserve.jwarc.ParsingException: invalid WARC record at position 71: ...og/lm_requestform.cfm?cFileno=WEL 802.02<-- HERE -->(B)&cDocno=LTSM00012740&cLoc=Internet%20...
at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:315)
at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:159)
at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:327)

Following up on this a bit, I analyzed the 58,901 ARC files that I'm working with and 10,763 have this Exception. My worry is that this Exception might be preventing access to records further along in the ARC file and that would be bad for my project. The analysis also revealed tons of "ERROR: invalid HTTP header Content-Length" errors, but I'm not sure what effect those have on processing the data.

These files were created by the Internet Archive back in 2004, so presumably they are correctly formatted (at least according to their interpretation of the spec.)

Thanks.

ato commented

This record contains an invalid URL that contains spaces, so parts of the URL end up in the wrong field.

Right, I see that, thanks! I am pursuing this with the Internet Archive as they created the file. Oddly enough the CDX file they sent that corresponds to this file has the space correctly encoded. Closing.