/loan-prediction

Predict which loans will be foreclosed on.

Primary LanguagePython

Loan Prediction

Predict whether or not loans acquired by Fannie Mae will go into foreclosure. Fannie Mae acquires loans from other lenders as a way of inducing them to lend more. Fannie Mae releases data on the loans it has acquired and their performance afterwards here.

Installation

Download the data

  • Clone this repo to your computer.
  • Get into the folder using cd loan-prediction.
  • Run mkdir data.
  • Switch into the data directory using cd data.
  • Download the data files from Fannie Mae into the data directory.
    • You can find the data here.
    • You'll need to register with Fannie Mae to download the data.
    • It's recommended to download all the data from 2012 Q1 to present.
  • Extract all of the .zip files you downloaded.
    • On OSX, you can run find ./ -name \*.zip -exec unzip {} \;.
    • At the end, you should have a bunch of text files called Acquisition_YQX.txt, and Performance_YQX.txt, where Y is a year, and X is a number from 1 to 4.
  • Remove all the zip files by running rm *.zip.
  • Switch back into the loan-prediction directory using cd ...

Install the requirements

  • Install the requirements using pip install -r requirements.txt.
    • Make sure you use Python 3.
    • You may want to use a virtual environment for this.

Usage

  • Run mkdir processed to create a directory for our processed datasets.
  • Run python assemble.py to combine the Acquisition and Performance datasets.
    • This will create Acquisition.txt and Performance.txt in the processed folder.
  • Run python annotate.py.
    • This will create training data from Acquisition.txt and Performance.txt.
    • It will add a file called train.csv to the processed folder.
  • Run python predict.py.
    • This will run cross validation across the training set, and print the accuracy score.

Extending this

If you want to extend this work, here are a few places to start:

  • Generate more features in annotate.py.
  • Switch algorithms in predict.py.
  • Add in a way to make predictions on future data.
  • Try seeing if you can predict if a bank should have issued the loan.
    • Remove any columns from train that the bank wouldn't have known at the time of issuing the loan.
      • Some columns are known when Fannie Mae bought the loan, but not before
    • Make predictions.
  • Explore seeing if you can predict columns other than foreclosure_status.
    • Can you predict how much the property will be worth at sale time?
  • Explore the nuances between performance updates.
    • Can you predict how many times the borrower will be late on payments?
    • Can you map out the typical loan lifecycle?