MDR is a library detect and extract listing data from HTML page. It implemented base on the Finding and Extracting Data Records from Web Pages but change the similarity to tree alignment proposed by Web Data Extraction Based on Partial Tree Alignment and Automatic Wrapper Adaptation by Tree Edit Distance Matching.
numpy
and scipy
must be installed to build this package.
MDR assume the data record close to the elements has most text nodes:
[1]: import requests [2]: from mdr import MDR [3]: mdr = MDR() [4]: r = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london') [5]: candidates, doc = mdr.list_candidates(r.text.encode('utf8')) ... [8]: [doc.getpath(c) for c in candidates[:10]] ['/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[2]/ul', '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]', '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]', '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[1]/div/div[2]/div[1]/div[1]/div', '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[2]/div/div[3]', '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[2]/div/div/ul', '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]', '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]/div[1]/table/tbody', '/html/body/div[2]', '/html/body/div[2]/div[4]/div/div[1]']
MDR can find the repetiton patterns by using tree matching under certain candidate DOM tree, then it builds a mapping from HTML element to other matched elements of the DOM tree.
You can annotate the seed elements with any tools (e.g. scrapely) you like, then mdr will be able to find the other matched elements on the page.
e.g. you can find this demo page here. the colored data in first row are annotated manually, the rest are extracted by MDR.
Terry Peng <pengtaoo@gmail.com>
MIT