/fass

This simple server enables scraping of website with dynamic content.

Primary LanguagePythonMIT LicenseMIT

FASS - FastAPI - Selenium - Scraper

Latest Tag Build Status License Docker Pulls Docker Image Size

This simple server enables scraping of website with dynamic content. It exposes the parser via rest API: http://localhost:8000/parse and accepts POST in the form of, e.g.

curl -X POST "http://localhost:8000/parse/" -H "Content-Type: application/json" -d '[
            {
                "url": "https://github.com/pymzml/pymzML/",
                "name": "Github stars",
                "delay": "1",
                "patterns": [
                    {
                        "name": "Star Counter",
                        "regex": "Counter js-social-count\\\">(?P<Stars>[0-9]*)</span>"
                    }
                ]
            }
        ]'

the payload contains a list of websites to scrape, each containing the url a name, delay in seconds and patterns. The two first kwargs are self explenatory, the delay parameters defines how many seconds the selenium driver should wait until the page is scraped. The pattern represent a list of entities to extract from the page, defined by Python regex expression and a name which will be used in the returned json.

The example above return:

{
    "name":"Github stars",
    "all_fields_matched":true,
    "Star Counter":["154"]
}

Please note that the matched values are always a list since we match all occurences on page. If multiple Python regex groups are defined, the returned list will contain tuples.

Installation

From source

Clone this repo and

docker build -t fass_app .

From Docker hub

docker pull zerealfu/fass:latest

Running the service

docker run -d -p 8000:8000 fass_app

then execute the curl for example:

curl -X POST "http://localhost:8000/parse/" -H "Content-Type: application/json" -d '[
            {
                "url": "https://github.com/pymzml/pymzML/",
                "name": "Github stars",
                "delay": "1",
                "patterns": [
                    {
                        "name": "Star Counter",
                        "regex": "Counter js-social-count\\\">(?P<Stars>[0-9]*)</span>"
                    }
                ]
            }
        ]'

Have fun :)