GoldMiner is a boilerplate removal algorithm.
It applies jusText to a webpage, but to only that part which contains the main content. This way comments, related stories etc won't be included.
GoldMiner learns the article positions of a webdomain, based on a sample (100 pages) from the given domain.
More details in the paper, at the end of this readme.
JusText is written by Jan Pomikalek, in python.
https://code.google.com/p/justext/
It has very good quality, so I have rewritten it in c++. :)
CleanPortalEval is an evaluate set for boilerplate removal algorithms. It is similar to previous ones (e.g. CleanEval), but it contains more pages from the same domain. (because GoldMiner needs more samples to learn a web domain)
It has two dependencies: pcrecpp and htmlcxx library. Last one is attached to project, it helps to parse the html.
Other one has to be linked.
GoldMiner repo contains 2 submodules (jusText + CleanPortalEval), so suggested git command is:
git clone --recursive https://github.com/endredy/GoldMiner
windows:
vcproj files might help. (goldMinerTester.vcproj depends on src/goldMiner.vcproj, jusText/jusText.vcproj, jusText/htmlcxx-0.84/htmlcxx.vcproj)
linux:
make
a simple test is included (GoldMinerTester.cpp), which cleans the dataset of CleanPortalEval.
The full test (compile, test and evaluation) can be done by evaluate.sh
If you use the tool, please cite the following paper:
@article{endredy_more_2013,
title = {More {Effective} {Boilerplate} {Removal} - the {GoldMiner} {Algorithm}},
issn = {1870-9044},
url = {http://polibits.gelbukh.com/2013_48},
language = {eng},
number = {48},
journal = {Polibits - Research journal on Computer science and computer engineering with applications},
author = {Endr{\'e}dy, Istv{\'a}n and Nov{\'a}k, Attila},
year = {2013},
keywords = {boilerplate removal, Corpus building, the web as corpus},
pages = {79--83}
}