In order to help human assessors building a Web page segmentation (ground truth) we developped the tool MoB (Manual-design-Of-Blocks). It is designed as a browser extension and offers functionalities to expert users for creating a manual segmentations. Users can create blocks based on Web page elements. They can merge blocks, navigate into the element hierarchy to produce a block graph, or produce a flat segmentation (i.e. leaves in the block tree). These segmentations are stored in arepository for the evaluation.
Our segmenter BoM segments a Web page without having a priori knowledge of its content and using only the heuristic rules defined by the W3C Web standards. For instance, we detect blocks using HTML5 content categories instead of using the tag names or text features. The segmentation process is composed of two main phases : detecting fine-grained (as small as possible) blocks and then merging them according to a stop condition, so that the segmentation is performed at the desired granularity. It can be expressed by the following function Φ(W), where W is the rendered DOM of a Web page.
Download MOB from Here
For installing the extension follow the same process as the BOM extension installation instructions
Dataset: Web pages sources (in ScrapBook firefox extension format)