Simple node.js server to allow navigation of the contents of a WARC file
- node.js
- npm
- csv plugin: npm install csv csv-stream stream-transform
- stdio plugin: npm install stdio
- Internet Archive WARC tools: pip install warctools
Sample warc used in testing: drupalib.interoperating.info.warc.gz
- Copy drupalib.interoperating.info.warc.gz to directory ../warc (relative to directory where warcnode.js is installed); or elsewhere
- gunzip drupalib.interoperating.info.warc.gz
- generate the csv index (in the same directory as drupalib.interoperating.info.warc.gz):
warcindex drupalib.interoperating.info.warc > drupalib.interoperating.info.warc.csv
- in the directory with warcnode.js:
node warcnode.js --warc ../warc/drupalib.interoperating.info.warc
(or substitute the path to your warc)
- drupalib.interoperating.info.warc does not contain all the files that are linked in the html - notably, the /themes/ directory is absent. 404 errors are returned for these requests.
- diagnose problem that causes truncated html sometimes