
Site Scraper for different document types, easy to use and easy to configure.

Primary LanguagePython


Site scraper for different document types, easy to use and easy to configure.


We highly encourage to use the provided python3 virtual environment setup, in general this can be activated like:

python3 -m venv $(pwd)/.venv/scraper

source .venv/scraper/bin/activate

pip3 install -r requirements.txt

This script uses selenium WebDriver for the crawling capabilities and for this two types of drivers are supported, Chrome and Firefox, the first one if preferred due to a bug in Firefox. At least on of the two browsers need to be installed in it's binary has to be set in $PATH. Also it's corresponding driver has to be also installed, chromedriver for Chrome and geckodriver for Firefox.


The two drivers aforementioned are available through Homebrew and they can be installed as follows:

brew install chromedriver
brew install geckodriver

Linux like OS


Has not been tested yet, please feel free to contribute with a PR.


The script uses a configuration yaml, the grammar of the configuration file is as follows:

  • sites - list type
  • string - element from the list that consists an identifier for each element
  • url - string, the link of the site that is going to be scraped
  • limiter - regex, the domain in which the scraper will be ran, in general it's the same as url
  • path - string, name of the folder under ${cwd}/documents where the documents are going to be downloaded
  • documents - list, strings that represent the extension of the documents


  - my.site.name:
      url: https://some.site.com
      limiter: site.com
      path: site.com
        - pdf
        - docx
        - doc