hplt-project/warc2text-runner
Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.
HTML
Issues
- 0
- 9
- 5
Faster json parsing
#1 opened by ZJaume - 0
- 2
Inconsistent release
#12 opened by rggdmonk - 0
Move from setuptools to poetry ...
#8 opened by nvanva - 1
Incorrect langid model in release v2.0.0-alpha.3
#10 opened by rggdmonk - 4
Tag filters moved from warc2text
#6 opened by nvanva - 1
Change trafilatura TEI/XML to text behaviour
#2 opened by ZJaume