/ucpd-incident-scraper

This code is going to be used to scrape the UCPD Daily Incident every day and store the incidents on a generic JSON data-store through the use of a Python 🐍 job on GCP's Cloud Run.

Primary LanguagePythonMIT LicenseMIT

UChicago Incident Page Scraper

This repository houses a scraping engine for the UCPD's Incident Report webpage. The data is stored on Google Cloud Platform's Datastore and ran using Heroku's Dyno functionality.

Primary Application Functions

  1. Scrape the UCPD Incident Report webpage every weekday morning, pulling all incidents from the latest reported incident date in the Google Datastore to the current day.
  2. Upload all stored UCPD incidents to the Chicago Maroon's Google Drive every Saturday morning.

Relevant Reading

  • Ethical Issues of Crime Mapping: Link

Acknowledgements

I'd like to thank @kdumais111 and @FedericoDM for their incredible help in getting the scraping architecture in place. As well as @ehabich for adding a bit of testing validation to the project. Thanks, y'all! <3

Project Requirements

  • Python version: ^3.11
  • Poetry

Required Credentials

  • Census API Key stored in the environment variable: CENSUS_API_KEY
  • Google Cloud Platform service account with location of the service_account.json file stored in the environment variable: GOOGLE_APPLICATION_CREDENTIALS
  • Google Cloud Platform project ID stored in the environment variable: GOOGLE_CLOUD_PROJECT
  • Google Maps API key stored in the environment variable: GOOGLE_MAPS_API_KEY
  • Google Drive Folder ID stored in the environment variable: GOOGLE_DRIVE_FOLDER_ID

Technical Notes

  • Any modules should be added via the poetry add [module] command.
    • Example: poetry add black

Standard Commands

  • make lint: Runspre-commit on the codebase.
  • make seed: Save incidents starting from January 1st of 2011 and continuing until today.
  • make update: Save incidents starting from the most recently saved incident until today.
  • make build-model: Build a predictive XGBoost model based off of locally saved incident data and save it in the data folder.
  • make categorize: Categorize stored, 'Information' labeled incidents using the locally saved predictive model.
  • make download: Download all incidents into a locally stored file titled incident_dump.csv.