Predicting Loan Defaults and Optimizing Your Investment Strategy

Introduction

Peer-to-peer (P2P) lending is the practice of lending money to individuals or businesses through online services that match lenders with borrowers, enabling anyone to play the role of the bank and fund loans with fixed interest rates. P2P lending is a boom industry: Lending Club is the largest P2P platform today and boasts a total of $26 billion in issued loans since its inception in 2007.

Say you are a large institutional investor that has been trusted with $1M to invest, and you're interested in investing your money on Lending Club. How should you invest your $1M bankroll? With hundreds of new loans possible options available to you daily, it's difficult to manually eyeball your options. Plus, your competitors are using automated models to snatch up the juicy loans before you even have a chance to blink. What you need is a data-driven solution: you need a machine learning model that can automatically pick out the best loans in a matter of seconds to maintain your competitive edge.

In this demo, we will demonstrate how you can use machine learning on the DataScience Platform to optimize your loan selection strategy.

Dataset

Dataset used here contains metadata for over 500K Lending Club loans. These data range from 2007-2017 and are publicly available.

The data includes hundreds of features including:

Interest rate
Debt to Income ratio (DTI)
Term (Loan Length)
Purpose of Loan
And many more

With the following binary response variable that determines whether a loan ultimately was paid-off or defaulted:

Default

Goals

The primary goal of this demo is to use machine learning to automatically select Lending Club loans that provide the best return on investment.

This goal can be broken into several subgoals:

We will use notebook containers to utilize the popular Python machine learning package scikit-learn to develop a random forest model for predicting the probability of default for a given loan.
We will synthesize any insights or methodology into reports to keep our stakeholders informed about our efforts.
We will deploy an API for our random forest model so that we productionize it to power any internal application we choose.
We will use the DataScience platform's scheduled runs feature to productionized our model to make automatic data-driven decision on daily batches of new loans, selecting out the best loan for our portfolio in seconds.

Key Benefits of the Platform For This Demo

The features of the DataScience Platform can help us achieve these goals with the following:

Notebook containers allow you to launch sessions with scalable resource allocations. This ensures that you have all the resources you need on-demand to implement sophisiticated machine learning models. This also ensures that your session does not overwhelm the resources available to the rest of your team. Finally, GitHub integrations allow you to sync the work you do within a session by simply clicking a button.
Reports allow you to effortlessly convert any notebook or Markdown formatted code into a polished report in one centralized location. No need to search through your inbox for that email with the screenshots from your analysis. Simply share the link to the report with your colleagues and you're done!
Deployed API allows you to productionize any model you create. This API can be shared with a colleague for his or her own project, used to power any internal application, or implemented in an automated scheduled run.
Scheduled runs allow you to automate a process hourly, daily, or at a time interval of your choice. This feature has synergy with the deployed API: you can schedule a model to score daily batches of new loans at the same time every day.

Structure of this Demo

This demo will include the following steps: