IllegalArgumentException on ARC Parsing
Closed this issue · 1 comments
Many thanks for fixing the issue with the newlines before records in ARC files. Tika is now able to process files, with the below exceptions.
Some files are giving an error:
NARA-PEOT-2004-20041109221858-00339-crawling004.archive.org.arc.gz
java.lang.IllegalArgumentException: parse error at position 20: text/html;ISO-8859-1<-- HERE -->
at org.netpreserve.jwarc.MediaType.parse(MediaType.java:386)
at java.base/java.util.Optional.map(Optional.java:260)
at org.netpreserve.jwarc.Message.contentType(Message.java:61)
at org.netpreserve.jwarc.WarcResponse$1.type(WarcResponse.java:71)
at java.base/java.util.Optional.map(Optional.java:260)
at org.netpreserve.jwarc.WarcResponse.payloadType(WarcResponse.java:62)
at org.apache.tika.parser.warc.WARCParser.processResponse(WARCParser.java:135)
...lots of other Tika messages
These files were all created by the Internet Archive back in 2004. Attached is the file that produced the above error.
NARA-PEOT-2004-20041109221858-00339-crawling004.archive.org.arc.gz
Released v0.29.0 which adds MediaType.parseLeniently() and uses it in Message.contentType().
In this case the invalid parameter which is missing "=" will be simply ignored instead of throwing IllegalArgumentException. When using the lenient parser validity can be checked with mediaType.isValid() and the original string accessed with mediaType.raw().