Scrapio

Scrapio is a suite of automatic web content extractors that are designed to detect, pull, and export structured and semi-structured web page content (i.e news articles and tables) into convenient formats, such as JSON and CSV.

This repo contains the source of several different web data extraction algorithms, originally designed to be used in conjuction with each other, but will also work as standalone page record miners.

Key Source File Directory

File	Description
`tree_matcher.py`	An adaption of Liu and Zhai's partial tree alignment algorithm [1], with several important modifications. Used for detecting structured content with few definitive HTML element attributes.
`tree_traversal2.py`	Custom element record detection, based on tag pattern matching and similar element attribute grouping.
`tree_merger.py`	Performs removal of extraneous and irrelevant HTML page content, such as headers, footers, etc.
`tree_converter_handlers3.py`	Converts large similar record objects, ultimately generated by the preceding files, into two dimensional `JSON` objects for export later.
`collection_segmentation.py`	Groups individual HTML elements into collections of related page data, reflecting the visual content layout on the source webpage.
`collection_segmentation.js`	Implementation of `collection_segmentation.py` in Javascript for use in a Chrome extension.

References

Liu, B., & Zhai, Y. (2005, May 10-14). Web data extraction based on partial tree alignment. Proceedings of the 14th international conference on World Wide Web, WWW 2005, Chiba, Japan. https://dl.acm.org/doi/10.1145/1060745.1060761

Ajax12345/scrapio

Scrapio

Key Source File Directory

References