/warc2text-runner

Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.

Primary LanguageHTML

Issues