CleanPortalEval
It is a boilerplate removal test set for portals.
It is similar to CleanEval test set, but it contains more pages from the same domain. Motivation of the dataset: some boilerplate removal algorithms need more sample from a domain. (e.g. GoldMiner)
Its input and its gold standard has the same format as CleanEval has. So the evaluation script can be used on these, as well.
It contains 70 pages from 4 domains.
Reference
If you use the tool, please cite the following paper:
@article{endredy_more_2013,
title = {More {Effective} {Boilerplate} {Removal} - the {GoldMiner} {Algorithm}},
issn = {1870-9044},
url = {http://polibits.gelbukh.com/2013_48},
language = {eng},
number = {48},
journal = {Polibits - Research journal on Computer science and computer engineering
author = {Endr{\'e}dy, Istv{\'a}n and Nov{\'a}k, Attila},
year = {2013},
keywords = {boilerplate removal, Corpus building, the web as corpus},
pages = {79--83}
}