/warc-extractor

extract a random sample of HTML files from WARCs

Primary LanguageJavaOtherNOASSERTION

warc-extractor

by Junqi Ma (jxm844@case.edu) and Tim Henderson (tim.tadh@gmail.com)

"-n 30000" is used to generate about 700 files whose sizes are larger than 300kb Example

./WarcExtractor -n 30000 --file crawl-file.warc.gz  -o result-dir

TODO: add command input to give the size of html file