This is our final project for the UChicago class CAPP 30254, Machine Learning for Public Policy. For this project, we use machine learning to predict the block groups in Chicago where eviction risk is highest (in the top 10%) in the next 3 years.
Nora Hajjar
Lilian Huang
Peter Li
Kyle Schindl
This is the list of Python libraries that should be installed to run our code:
- numpy
- pandas
- sodapy
- datetime
- census
- functools
- geopandas
- shapely
- statsmodels
- matplotlib
- seaborn
- csv
- scikit-learn
- aequitas
These are all in the raw_data directory
- block-groups.csv, the original dataset on evictions which is manually downloaded from the Eviction Lab
- cb_2017_17_bg_500k, a block groups shapefile which is manually downloaded from the Census Bureau
- HOLC_Chicago, a redlining shapefile which is manually downloaded from the University of Richmond's Mapping Inequality project
- chicago_blocks.csv, a dataset containing all census blocks in Chicago which is manually downloaded from the Open Data Portal of the City of Chicago
- Crimes_-_2001_to_present.csv, which is manually downloaded from the Open Data Portal of the City of Chicago
We also made use of American Community Survey estimates, but these were accessed through an API (in Notebook_cleaning.ipynb) rather than being downloaded manually.
These are all in the code directory.
- In this notebook we clean, update, and merge various sources of data:
- Notebook_cleaning.ipynb
- These files are used to run the machine learning pipeline/models:
- go.py
- final_pipeline_for_vm.py
- These notebooks are used for model evaluation and comparison:
- Evaluation.ipynb
- Model Comparisons.ipynb
- This notebook is used to create descriptive plots based on the data:
- Plots.ipynb
- This notebook uses Aequitas to analyze bias in the models:
- Aequitas.ipynb
These are all in the output_files directory. These are intermediate or output files that will be generated by running our code, but copies are included here for reference as well.
- full_data_chicago.csv is produced by running Notebook_cleaning.ipynb. It is the dataset which is then read into go.py for final processing and building the machine learning models. It is also used in Plots.ipynb to generate descriptive plots.
- results.csv is produced by running go.py. It contains the full list of machine learning models we trained and tested, and the performance metrics for each of them. It is then read into Evaluation.ipynb and Model Comparisons.ipynb to evaluate and compare our models.
- train.csv and test.csv are produced by running go.py. They are the training/test set used to fit our chosen best model, so we can generate our final predictions. They are used in Evaluation.ipynb and Aequitas.ipynb to evaluate the performance and bias of our chosen best model.
- block_groups_intervene.csv is our final generated list of predictions, i.e. the block groups where intervention should be applied, as predicted by our chosen best model.