-
New York is the most popular city in United States of America(USA). It has a population of more than 8.5 million and it keeps growing. Being densely populated city has direct impact on crime rate. However, the crimes in city are unevenly distributed depending on various factors. Once known as the "murder capital", NYC crime, especially murder rates, has decreased in that last two decades. However, crime has not stopped, so it's important to know when, where, and what crimes are taking place. It is especially important to know what kind of crimes happen where. Thus, we're going to embark on an exploratory data analysis of crime complaints in NYC and make predictions about crimes in Brooklyn.
-
New York has 5 administrative regions called Boroughs :=
['The Bronx', 'Manhattan', 'Brooklyn', 'Queens', 'Staten Island']
-
Objectives -
- To understand crime rate in New York in the last 6 years (2013 - 2019).
- What kind of crimes are most prevalent in New York?
- Are there certain times when crime is more likely to occur? If so, when?
- Find if there is any major difference in the regions of New York in crime rate.
- Try to find associations / patterns in crime analysis with respect to another important criteria in New York.
-
Data Sources -
-
The data for this analysis is taken from two sources namely -
- NYPD Complaint Data Historic : This dataset includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to the end of 2017
- NYPD Complaint Data Year To Date : This dataset includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) for all complete quarters so far from 2018 to 2019
-
Both the datasets belong to the New York Crime Department. This dataset is publicly avaiable via the Socrata Open Data API. More about the Socrata Open Data API can be found here
-
-
Note on limitations -
- Historic data has been collected from 2013 to 2017. For each year 6 months data has been collected from January to June
- Year to Date data has been collected from 2018 to 2019. For each year 6 months data has been collected from January to June
- Manually change dates in the scraping utility to get data
setDefaults()
: acts as setter function that initialises global variables which can be referenced throughout the notebookisValid()
: validation checks to check if the request parameters and other globals are passed correctly to the APIisValidAndSetDefaults()
: helper function for setDefaults() and isValid()
getData()
: helper function to support the below operations
1.1buildURL()
: constructs URL from base URL and global vars_
1.2callAPI()
: helper function that performs data fetch operation
1.3getDataByChunks()
: main function that builds the data by fetching it in chunks and dumping into the csv filesmergeCSV()
: merges the resultant csv files obtained from getData() to a single resultant csv file
preprocessData()
: preprocesses data and returns two df respectively for historic and yearToDate
1.1preprocessHistoric()
: preprocess data and returns df for historic data
1.2preprocessYearToDate()
: preprocess data and returns df for yearToDate data
Note -
To trigger this API call for 6 years, you have to manually change the dates in the preprocessHistoric()
and preprocessYearToDate()
- For
preprocessHistoric
, a sample setup for API request will be as:
def preprocessHistoric(dataURI): isValidAndSetDefaults(dataURI, <start_date_as_string>, <end_date_as_string>, "data") #This is the line that has to be modified
- For
preprocessYearToDate
, a sample setup for API request will be as:
def preprocessYearToDate(dataURI): isValidAndSetDefaults(dataURI, <start_date_as_string>, <end_date_as_string>, "data") #This is the line that has to be modified
Also, there must be empty directories created with exact name as data_Historic
and data_YTD
in the same directory where this notebook exists to ensure that this code runs successfully. Without these directories, the user will not be able to send the API request
Task 4: Load and represent the data using an appropriate data structure. Apply any pre-processing steps to clean/filter/combine the data
Trimming attributes based on formula - sum(nan_counts_in_attribute) / len(attribute).
fillMissingData()
: Fills missing values in key columns and performs further trimming of Nan values
1.1fillMissingComplaintDate()
: Helper to extract only 3 date columns
1.2fillNANsInComplaintDate()
: Fill NAN values in the columnComplaint from date
from columnComplaint end date
/Reporting date
. 1.3reTrimAndRenameColumns()
: Trim more columns not required and rename columns to more meaningful
Note - In Ideal scenarios the filed report date for a crime is usually near to the actual date when crime happened. Hence, reporting_date is dumped into the complaint_from_date where the actual complain date of crime is missing and complaint_end_date is missing.
reduceMemory()
: Most of the column are not needed to be stored as they are. They can be transformed into categoriesextractYearAndMonth()
: Extracts year and month from date columnprocessDates()
: Fill NAN values in the columnComplaint from date
from columnComplaint end date
/Reporting date
Note - In Ideal scenarios the filed report date for a crime is usually near to the actual date when crime happened. Hence, reporting_date is dumped into the complaint_from_date where the actual complain date of crime is missing and complaint_end_date is missing.
The information about areas size of boroughs is given here
Note - The blue dots represent the crime number and the line represents the crime rate
Inference -
- There is a steady drop in crime rate from years 2013 - 2018.
- Approximately crime rate drops by roughly 7000 crimes per year.
Inference -
- The crime rate drops steadily in all the boroughs with fastest drop in Brooklyn and slowest drop in Staten Island
- Crime rate depending on number of offence isn't the best indicator to compare crimes in Boroughs as the boroughs differ significantly in the areas.
Number of offences was a problem in earlier part. Hence, if we spread same number of offences over larger area, there will be less imbalance in the data. Lets plot Borough Areas.
The above plot clearly shows that Queens has area size almost 5 times greater than Manhattan
Hence, we use crime density instead of number of offences.
Work in Progress..