This project aims to develop a machine learning classifier to predict stop and search outcomes in scotland. It uses published data from Police on individual stop and search instances combined with the Scottish Index of Multiple Deprivation to provide information on the area the stop and search took place in.
This project was developed with the April - December 2020 stop and search dataset, published on the Police Scotland website
Scottish Index Multiple Deprivation v2 2020
Within the Get Data directory, there is a "get_data.py" python file that will download the data sets used.
It will also scrape some additional data on data zones and electoral wards from statistics.gov.scot that are needed for aggregating the SIMD data in the larger electoral ward zones.
Two Jupyter notebooks in Cleaning walk through the data cleaning and exploratory data analysis process and explain the decisions made.
This directory also contains eda.py which is a set of functions to help with the cleaning and EDA process.
The most thoroughly explored classifier is an SKLearn logistic regressor in logistic_regression.ipynb. This notebook goes through a forward selection of features from the two datasets, evaluates key metrics from the selected model, optimises the threshold choice and compares against the test dataset.
A short exploration a decision tree model is contained in decision_tree.ipynb.