Apifier

Apifier is a very simple HTML parser written in Python.

It aims to parse HTML document in a declarative way using css selectors. Its main purpose is to parse tabular and/or paginated data.

Install

Apifier is available for python 3

pip install apifier

Example

Getting all comments from an article at "LeFigaro.fr"

from apifier import Apifier

config = {
    "name": "FigaroBot article comments",
    "encoding": "latin-1",
    "url": "http://www.lefigaro.fr/politique/le-scan/2016/07/21/25001-20160721ARTFIG00062-attentat-de-nice-la-droite-demande-une-enquete-independante.php",
    "foreach": "#fig-pagination-nav > li > a",
    "context": "page",
    "prefix": ""#reagir > div > div > div.fig-col.fig-col--comments > div:nth-child(3) > ul > li > article >",
    "description": {
        "author": "div.fig-comment-header a",
        "comment": "div.fig-comment-msg p"
    }
}

api = Apifier(config=config)
data = api.load()

Config

name : name of the current configuration
encoding : is the encoding the page is using, data will be converted from this encoding to utf-8 for sanity
url : page url, first page in case of paginated data

foreach : css selector for the pagination links int this example pagination looks like :

<ul id="fig-pagination-nav">
  <li class="fig-pagination-current"><a href="…"> 1 </a></li>
  <li><a href="…"> 2 </a></li>
  <li><a href="…"> 3 </a></li>
</ul>

context : each data will be associated with a special variable named after the content of the pagination link in this case, this content is just the page number, but the pagination mechanism can be used for othher purpose like categories
prefix : descriptors will be prefixed by this option
description : descriptor for content to parse, in this example, comment content and author name.

The result looks like this :

    data =
    [
        {'comment': "…", 'author': '…', 'page': '1'}, etc
    ]

LucasBerbesson/apifier

Apifier

Install

Example

Config