Crawler System

Airflow + Celery + Docker + AWS S3 crawler system

How to use

The system has been deployed to and Please refer to API docs.

System Architecture


API docs

set_urls (set the urls you want to scrape)

  • URL /set_urls
  • Method POST
  • URL Params None
  • Data Params (application/json) {"url": ["", ""]}
  • Success Response 'status: OK'

retrieve_by_url (get the scraping result by a single url)

  • URL /retrieve_by_url
  • Method GET
  • URL Params url
  • Data Params None
  • Success Response {"status: "SUCCESS", "result": {"time": "Fri Jul 31 11:30:07 2020", "context": "html context...", "metadata": {"url": "", "domain": "", "title": "Google"}}, "traceback": null, "children": [], "date_done": "2020-07-31T11:30:07.159309", "task_id": ""}}
  • Example

retrieve_by_urls (get the scraping result by multiple urls)

  • URL /retrieve_by_urls
  • Method POST
  • URL Params None
  • Data Params (application/json) {"url": ["", ""]}
  • Success Response {"": {retrieve_by_url result}, "": {retrieve_by_url result}}

retrieve_by_domain (get the scraping result by a single domain)

Set up at your local

  1. Rename the file of environment variables
$ mv .env.default .env
  1. Add your own aws access_key_id and secret_access_key in .env (create the bucket and directory in S3 if necessary)
S3_ACCESS_KEY_ID=Your aws access key id
S3_SECRET_ACCESS_KEY=Your aws secret access key 
  1. Build docker image and run
$ docker-compose up
  1. Access the airflow panel in http://localhost:8080/ and API interface in http://localhost:5000/