Automated workflow for data scraping

Process

  • Get data from URL
  • Clean data in Pandas dataframe
  • Convert dataframe to .csv
  • Store .csv in S3 bucket as text file
  • Deploy scraper to AWS lambda and run it there every 3 minutes

Creating a new scraper

  • Create a copy of the scraper_template.py file or one of the existing scrapers
  • Implement your scraper
  • Add your scraper to the list of scrapers in handler.py

How to test locally

  • Run your scraper .py skript: python get_data_scrapername.py

Deploy

  • Push your code to the repo, it will get deployed automatically

Contributing

  • Make pull request (or ask for repo membership)
  • After review your new scraper will be deployed automatically

Error reporting

  • Any errors in a scraper are reported to Sentry
  • Use asserts to confirm the data is valid
  • If non-critical but interesting changes to the data are noticed, report them to Slack via webhook