/ml-chi-evictions

Our final project for CAPP 30254

Primary LanguageJupyter Notebook

Predicting eviction risk in Chicago

This is our final project for the UChicago class CAPP 30254, Machine Learning for Public Policy. For this project, we use machine learning to predict the block groups in Chicago where eviction risk is highest (in the top 10%) in the next 3 years.

Team Members

Nora Hajjar
Lilian Huang
Peter Li
Kyle Schindl

Requirements

This is the list of Python libraries that should be installed to run our code:

  • numpy
  • pandas
  • sodapy
  • datetime
  • census
  • functools
  • geopandas
  • shapely
  • statsmodels
  • matplotlib
  • seaborn
  • csv
  • scikit-learn
  • aequitas

Files Containing Raw Data

These are all in the raw_data directory

  • block-groups.csv, the original dataset on evictions which is manually downloaded from the Eviction Lab
  • cb_2017_17_bg_500k, a block groups shapefile which is manually downloaded from the Census Bureau
  • HOLC_Chicago, a redlining shapefile which is manually downloaded from the University of Richmond's Mapping Inequality project
  • chicago_blocks.csv, a dataset containing all census blocks in Chicago which is manually downloaded from the Open Data Portal of the City of Chicago
  • Crimes_-_2001_to_present.csv, which is manually downloaded from the Open Data Portal of the City of Chicago

We also made use of American Community Survey estimates, but these were accessed through an API (in Notebook_cleaning.ipynb) rather than being downloaded manually.

Scripts and files containing code

These are all in the code directory.

  • In this notebook we clean, update, and merge various sources of data:
    • Notebook_cleaning.ipynb
  • These files are used to run the machine learning pipeline/models:
    • go.py
    • final_pipeline_for_vm.py
  • These notebooks are used for model evaluation and comparison:
    • Evaluation.ipynb
    • Model Comparisons.ipynb
  • This notebook is used to create descriptive plots based on the data:
    • Plots.ipynb
  • This notebook uses Aequitas to analyze bias in the models:
    • Aequitas.ipynb

Intermediate/output files produced through our processing

These are all in the output_files directory. These are intermediate or output files that will be generated by running our code, but copies are included here for reference as well.

  • full_data_chicago.csv is produced by running Notebook_cleaning.ipynb. It is the dataset which is then read into go.py for final processing and building the machine learning models. It is also used in Plots.ipynb to generate descriptive plots.
  • results.csv is produced by running go.py. It contains the full list of machine learning models we trained and tested, and the performance metrics for each of them. It is then read into Evaluation.ipynb and Model Comparisons.ipynb to evaluate and compare our models.
  • train.csv and test.csv are produced by running go.py. They are the training/test set used to fit our chosen best model, so we can generate our final predictions. They are used in Evaluation.ipynb and Aequitas.ipynb to evaluate the performance and bias of our chosen best model.
  • block_groups_intervene.csv is our final generated list of predictions, i.e. the block groups where intervention should be applied, as predicted by our chosen best model.