/Scrapers

Code relating to Scraping.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Police Data Accessibility Project Scrapers

This repo contains the record scrapers (and associated tooling) to further the goals of the Police Data Accessibility Project. Thank you for your interest in contributing!

Getting Started

Quick start

  1. Clone this repo.
  2. Make a copy of the template folder in the appropriate jurisdiction folder. Read more about structure below.
  3. Code your scraper.
  4. Scrape sample data from the source and add a truncated version to the folder so we understand the kind of data your scraper generates.
  5. Complete the readme to the best of your ability.
  6. If you know how to use Splunk, complete the config file.
  7. Submit a Pull Request for approval.

Structure

Stick to the format of USA/$STATE/$COUNTY/$RECORD_TYPE. If there are state-level records being scraped, use USA/$STATE/_State/$RECORD_TYPE. Use underscores rather than spaces or dashes.

Legal

Only scrapers that comply with our legal guidelines will be merged into this repo.

General Guidelines

Python is preferred. If you use another language, please document your work.

Your scraper must comply with our legal guidelines.

Everyone working on this project is using their free time. Please expect some back-and-forth communication when speaking to the individuals reviewing your PR's and be patient and respectful with us. The more work you do to test and validate that your scraper has met the contribution guidelines, the quicker we can accept it.

Getting Help

The #scrapers_general slack channel is the place to start.

Known datasets

This dataset catalogue is how we track potential sources.

Fields to scrape

Note: the naming convention for these fields may not be consistent across data sources. If any fields are not retrievable please fill it with "NA".

  • _id
  • _state
  • _county
  • CaseNum
  • FirstName
  • MiddleName
  • LastName
  • Suffix
  • DOB
  • Race
  • Sex
  • ArrestDate
  • FilingDate
  • OffenseDate
  • DivisionName
  • CaseStatus
  • DefenseAttorney
  • PublicDefender
  • Judge
  • ChargeCount
  • ChargeStatute
  • ChargeDescription
  • ChargeDisposition
  • ChargeDispositionDate
  • ChargeOffenseDate
  • ChargeCitationNum
  • ChargePlea
  • ChargePleaDate
  • ArrestingOfficer
  • ArrestingOfficerBadgeNumber