2016-2022-Vancouver-Crime-Analysis

This project will utilize data analytics and machine learning models as a tool for the Vancouver Police Department (VPD) to predict hourly theft crimes across different neighbourhoods in Vancouver.

To better understand the 2016-2022 crime data, it was essential to conduct research and bring in relevant data from other sources that could potentially provide the information necessary to build the model we sought. Multiple data sources were eventually identified and being incorporated into the original crime dataset from VPD. Datasets were prepared for analytics and machine learning via cleaning, transforming, joining and aggregating in SQL and Python. The objective of initial data analysis was to identify key crime trends and patterns that could provide direction for further analysis and model-building. The crime data from VPD included different categories of crime information, such as Theft, Break In, Attack and Vehicle Collision. For our exploratory data analysis, we examined the overall trends of reported crime cases, and analyzed time-related patterns and geographical related patterns. Theft crime accounts for most of the crimes in Vancouver and we then decided to narrow down our focus to Theft crimes when building machine learning models. The ultimate goal is to implement predictive policing to help reduce crime, while mitigating risk to law enforcement officers.

The dataset for the crime prediction model was aggregated based on date-time and Vancouver neighbourhoods. This ensures that classification models can be built to generate hourly predictions of whether a particular hour and area is likely to be at high or low risk of theft crime. Descriptive analysis and distribution plots were also used to identify key variables from different categories, such as hour, location, weather etc. It is conclusive that Vancouver downtown has the highest number of cases for all types of crime, and there appears to be higher occurrence of theft incidents in the evening hours (at 35.10% of all reported theft cases) and in the afternoon (at 34.02% of all reported theft cases), as more individuals are out, meaning more stolen bikes, more theft of vehicles or from vehicles and more people to physically steal from. It was determined that XGBoost was the best means for developing a predictive model for classifying the dependent variables, which is the risk level of theft crime in a given hour and region. Categorical features such as neighbourhood, day of week, month, hour; and numeric features such as daylight hour, temperature, precipitation, unemployment rate, median household income were used and ran multiple tree based algorithms using Python. Synthetic data was generated to overcome the problem of class imbalance, since the majority of data points belong to Low Risk level. Model predictive ability was then validated by running unseen data, and produced satisfactory classification results. The model could correctly classify 81% of the actual high risk category, proving its ability to maintain its level of accuracy to assist VPD to assess whether a region at a given day and time has more than 3 crimes (HIGH RISK).