Html web content extraction library using mostly DOM features as well as some textual features. Achieves a tag-level F1-score of .96
on the Dragnet dataset.
First you will need to install the dependencies. For the binary dependencies:
sudo apt-get install recode libxml2-dev libxslt1-dev unzip
Python dependencies:
pip install -r requirements.txt
Build the project and install it locally
pip install -e .
./learnhtml/cli/prepare_data.sh <<WHERE_TO_DOWNLOAD_DATA>> <<NUMBER_OF_WORKERS>>
Copyright (C) 2018 Nichita Uțiu