ARC variants with different interpretations of version-block length
tballison opened this issue · 7 comments
It looks from unit tests that jwarc should read arc files. When I try to read ARC test files from warcio, I'm getting an exception.
Is this user error in how I'm calling jwarc or are ARC files not supported?
Test files:
https://github.com/webrecorder/warcio/blob/master/test/data/example.arc
https://github.com/webrecorder/warcio/blob/master/test/data/example.arc.gz
My code:
try (InputStream is = Files.newInputStream(Paths.get("/.../example.arc"))) {
WarcReader reader = new WarcReader(is);
for (WarcRecord record : reader) {
System.out.println(record.type());
}
}
"warcinfo" is printed once on the console, then there's an exception:
Exception (is the same for both files):
ava.io.UncheckedIOException: org.netpreserve.jwarc.ParsingException: invalid WARC record at position 0: <-- HERE -->filedesc://live-web-example.arc.gz 127.0...
at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:330)
at org.netpreserve.jwarc.apitests.ArcTest.testMine(ArcTest.java:108)
.....
Caused by: org.netpreserve.jwarc.ParsingException: invalid WARC record at position 0: <-- HERE -->filedesc://live-web-example.arc.gz 127.0...
at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:309)
at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:159)
at org.netpreserve.jwarc.WarcReader$1.hasNext(WarcReader.java:328)
... 28 more
There seems to be some variation in how the length field in the version block is calculated between different ARC files. jwarc's ARC support was tested against files generated by Heritrix and some other tools from the Internet Archive.
The example.arc file you linked to has a length value of "75" (0x4b) in the version block. This would exclude the two newlines at the end of it:
Whereas an ARC file in our collection sourced from the Internet Archive includes just the first newline as part of the length "76" (0x4c):
The ARC file format reference itself seems to introduce two more possible variations! It defines the length for the version block as:
The length specifies the size, in bytes, of the rest of the version block.
and the grammar for version-block defines it as ending with two newlines:
version-block == filedesc://<path><sp><version specific data><sp><length><nl>
<version-number><sp><reserved><sp><origin-code><nl>
<URL-record-definition><nl>
<nl>
But reading carefully we see that doc
is defined as starting with a single <nl>
:
arc_file == <version_block><rest_of_arc_file>
rest_of_arc_file == <doc>|<doc><rest_of_arc_file>
doc == <nl><URL-record><nl><network_doc>
So a strict reading of the grammar implies there should in fact be three newlines between the text "Archive-length" and the URL of the first doc, and the first two of them should count towards length as they're part of the version block.
If we look at the example in that same document though it uses a length of "76" (0x4c) and only has two newlines and counts both of them:
Have you seen this error with in the wild ARC files containing real data as well or just the example files from the warcio unit tests? I'm also curious what such files look like if they have more than one document in them and whether they also have extra linefeeds between documents or if it's just the version-block length that differs.
For reference there's an example Heritrix ARC file here which jwarc can successfully read: https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.arc.gz
Greetings - tballison has sourced many of his test files from my agency, the National Archives and Records Administration. The ARC files were actually created by the Internet Archive with Heritrix back in 2004.
I can confirm that the files we received from the Internet Archive appear to have three newlines before the first record (0A 0A 0A). And the record length takes us from the end of the header to the beginning of the first record:
As for the remaining records, a single new line is the most used separator:
Preceding bytes: b'\n\n' - Occurrences: 228
Preceding bytes: b';\n' - Occurrences: 199
Preceding bytes: b'\xd9\n' - Occurrences: 130
Preceding bytes: b'>\n' - Occurrences: 96
Preceding bytes: b'\x00\n' - Occurrences: 15
Preceding bytes: b' \n' - Occurrences: 10
Preceding bytes: b'\r\n' - Occurrences: 7
Preceding bytes: b'\x82\n' - Occurrences: 5
Preceding bytes: b'}\n' - Occurrences: 4
Preceding bytes: b'\xa9\n' - Occurrences: 2
Preceding bytes: b'\xb0\n' - Occurrences: 1
Preceding bytes: b'F\n' - Occurrences: 1
Preceding bytes: b'l/' - Occurrences: 1
Preceding bytes: b'\x83\n' - Occurrences: 1
Preceding bytes: b'\x7f\n' - Occurrences: 1
Preceding bytes: b'd\n' - Occurrences: 1
What @gleporeNARA said. LOL. Thank you so much @ato for looking into this! Let me know if I can help in any way.
Fix released as v0.28.6. Should sync to Maven central in an hour or so.
I've updated jwarc to accept 0 to 3 newlines between the end of the previous record's body and the URL of the next record. This should make it compatible with all the variants discussed above and it seems to work with the warcio example.arc:
$ jwarc cdx example.arc
CDX N b a m s k r M S V g
com,example)/ 20140216050221 http://example.com/ text/html 200 - - - 1658 150 example.arc
$ jwarc extract --payload example.arc 150
<!doctype html>
<html>
<head>
<title>Example Domain</title>
...
I've also made it understand the "v2" version-block headers and fixed the parsing exception message so the "<-- HERE -->" should show the right context now.
Wow. Thank you. I'll upgrade in Tika and see what I find on my local set of files.