iipc/webarchive-commons

WAT extractor: WARC-Filename in the WAT warcinfo record should be the WAT filename itself

saraaubry opened this issue · 0 comments

In the current implementation of the WAT extractor, the WARC-Filename in tht WAT warcinfo record corresponds to the filename of the original (W)ARC record.
According to the WARC ISO standard, it should be the WAT filename itself.

Current:
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2015-02-18T10:24:54Z
WARC-Filename: BnF-6224-50-20150218094547-00001-ciblee_2015_menelas2.bnf.fr.warc.gz
WARC-Record-ID: urn:uuid:97a37ea9-1af4-4c47-8ae0-5515428347aa
Content-Type: application/warc-fields
Content-Length: 73

Target:
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2015-02-18T10:24:54Z
WARC-Filename: BnF-6224-50-20150218094547-00001-ciblee_2015_menelas2.bnf.fr.warc.wat.gz
WARC-Record-ID: urn:uuid:97a37ea9-1af4-4c47-8ae0-5515428347aa
Content-Type: application/warc-fields
Content-Length: 73

Implementation:
java extractor.jar -wat fichierA.warc.gz --> will go to standard output
WARC-Filename:
fichierA.warc.gz => fichierA.warc.wat.gz
fichierA.arc.gz => fichierA.arc.wat.gz
fichierA.warc => fichierA.warc.wat
fichierA.arc => fichierA.arc.wat

java extractor.jar -wat fichierA.warc.gz fichierB.wat.warc.gz --> will go to file fichierB output
WARC-Filename: fichierB.wat.warc.gz