👮📊
A project to track the statistics of the arrests of the Honolulu Police Department
View Project
This project provides a dashboard interface that tracks and updates from the HPD's published daily arrest log reports.
The Attorney General's Office provides annual reports as to the state of crime in Hawaii. This project provides a mechanism to validate these reports, track the numbers daily, and keep an archive of the raw data.
Using a combination of image cropping and OCR, we extract data about each arrest from each daily published arrest log.
Everyday (with cron
!), the script is run (cd scrape && python3 main.py
) to scrape and parse the newly published arrest log. It then does the following:
- Uploads the PDF file to AWS S3 for archiving
- Downloads the PDF file locally for parsing purposes
After we download the file, we prepare it for image cropping and OCR. To do this, we
- Split the PDF into individual pages (Example Page PDF)
- Convert all the PDF file's pages into images (Example Page Image)
- Vertically concat all the page images into one long image, cropping the top and the bottom out so we only contain arrest records (Example Vertically Concatted Image)
- Crop each individual arrest record using the location of pixels (Example Record Image)
- Crop each portion of the arrest record by the categories we want to parse:
- Use OCR(PyTesseract) to parse the text
We then upload the data to AWS DynamoDB. Using Flask and DynamoDB's boto3 module, data is served to the HPDStats website. An example of the artifacts generated from the script can be viewed here: Example Artifacts