In this project we have to predict the Whether the customer will fall under default or not.The main use of classification models is to score the likelihood of an event occuring. For loan data, the model will be used to predict whether a loan will be paid off in full or the loan needs to be charged off and possibly go into default. You can use the model to score the quality of current loans and identify the ones most likely to default.😀🏹 🥇💯
You will Need Anaconda to Execute all the Codes So Install it first and then Go through the Below Codes. To Download anaconda, Click Here.
Then after installation of anaconda open anaconda navigator and install the jupyter notebook inside the anaconda environment then execute the .ipynb code inside the jupyter notebook.
To classify if the borrower will default the loan using borrower’s finance history. That means, given a set of new predictor variables, we need to predict the target variable as 1 -> Defaulter 0 -> Non-Defaulter. A MIS_Status of a Bank will be Default when a borrower would not be able to make payment on time or delays payments , refuses or avoids payment. An Individual , Organizations and even states could fall under default. To avoid the loss we have come up with a model which could predict whether one could fall under Default or will he be able to pay in full.
- Data Preparation – Out of the 26 features in our dataset, Many of them had empty cells. We have Imputed such cells. Also, the features which didn’t seem relevant to our goal were removed.
- Data Modelling – Data will be divided into two parts. Training data and testing data. The plan is to develop a model. The model will be developed on Training data and Evaluated on Test data.
- Model Evaluation – After developing the Model it is evaluated on accuracy, precision, recall and F1 score .
- Model Deployment – The Model is deployed using flask.
Total number of rows – 149999 Total number of columns – 26 Date ranges (Year’s) – 1962 to 2007 Missing Values - 110938 Target Variable - MIS_Status The we find Unique Values of each variable(Column).
- Charge OFF
- Paid in Full
- New Exist
- LowDoc
- Franchise Code
- Urban Rural
- RevLineCr
- MIS_Status
- Zip
- CCSC
- ApprovalFy
- Term
- NoEmp
- Create Job
- Retained Job
- Disbursement Gross
- Balance Gross
- Chg Off Prin Gr
- GrAppv
- SBA_Appv
- Name
- City
- State
- Bank
- BankState
- ApprovalDate
- ChgoffDate
- DisbursementdaDate
The dataset had many empty and irrelevant features They have been removed. ● String values have been formatted to integers. ● Categorical values have been transformed to numerical. ● Redundant variables have been dropped. ● Filled NAN values with mode values of corresponding columns.
- Before Dealing with the Missing values
- After Dealing with the Missing values
“DisbursementGross”, “BalanceGross”, “ChgOffPrinGr”, “GrAppv”, “SBA_Appv” these variables are converted to numeric. ● MIS_Status : PIF : 0(110831) ChgOff : 1(38926) ● Franchise Code: - No Franchise : 0(144504) , Franchise :1(5253) ● NewExist : 1 = Existing Business(101773), 2 = New Business(47984) ● Low Doc : No(0) : 137716 Yes(1) :12041 ● RevLineCr : No (0):- 95157 Yes(1) :-54600
- City – LOS ANGELES, NEW YORK & MIAMI Are Top Three city.
- State – CA, NY & TX Are Top Three State.
- Bank - Bank OF AMERICA NATL ASSOC, CITIZENS BANK NATL ASSOC & CAPITAL ONE NATL ASSOC Are Top Three Bank.
- Bank State – NC, RI & IL Are Top Three Bank State.
- CCSC – There are total 1184 Unique codes.
- APPROVAL DATE - Most of the Small Business Administration Commitment were Approved on 1997-09-30.
- Term – Most of the Loans have a term of 84.
- NoEmp – Number of Business Employees(SUM) - 30787
- NewExist – Existing Business -101648 New Business-47984
- Create job & Retained job – Number of Jobs created 37130
- LowDoc – LowDoc Loan program: Y=12041 , N=137632
- Urban Rural – Urban-81724 rural-51370
- RevLineCr - Revolving Line of Credit Yes=71503, NO=49780
- MIS_Status(target Variable) - Loan Status CHGOFF(high risk)-38926, PIF( lower risk)- 110831
- Disbursement Date –Most of loan amount was Disbursed on 2006-05-31
- GrAppv – Minimum-200 and Maximum-4000000
- SBA_Appv – Minimum-100 and Maximum-4000000
● Bank With Highest Gross Approval – The Top 3 bank with highest Gross Approval amount are SMALL BUS. GROWTH CORP, FIRSTMERIT BANK, N.A and GE CAP. SMALL BUS FINAN CORP. ● Bank State With Highest Gross Approval – The top 3 Bank States with highest Gross Approval are ILLINOIS , OHIO and TEXAS ● Urban/Rural with Gross Approval – People Staying in Urban areas are having more Gross Approval than the other ● Urban/Rural with Gross Approval and MIS_Status – People staying in Rural with Gross Approval of less than 100000 are mostly falling under PIF as Compared to the Urban. ● NewExist With Term period and MIS_Status - From the above graph we can see the people with Existing Bussiness with Term period of 120 fall under PIF ● GrAppv,LowDoc with MIS_Status- People with No LowDoc Loan Program are mostly under PIF ● RevLineCr with MIS_Status - People with No Revolving Line of Credit fall more under PIF than the people with those. ● NewExist and MIS_Status – People with the Existing Business are more falling under PIF
Using ExtraTreesClassifier, feature importance is determined and unnecessary variables are removed again. Below columns shows high Importance: “Zip”, “CCSC” , “ApprovalFY” , “Term”, “UrbanRural” , “LowDoc”, “DisbursementGross”,“GrAppv”,“SBA_Appv” We are dropping ChgoffPrinGr because it depends on the target variable.
SGDClassifier.
Support Vector Machine (SVM)
KNN (K- Nearest Neighbour)
Random Forest
Random Forest with RandomizedSearchCV
Random Forest with SMOTE
XGBClassifier
For our Loan default prediction project, False Negatives Rate is the best metric to evaluate the model. Lower the number of false negatives, better the model is. In this project, False negative is when model predicting “a borrower will not default a loan even though he will “. Our model cannot afford having higher False Negatives as it leads to negative impact on the investors and the credibility of the company. So, we evaluated our models using the number of False negatives and accuracies.
We did Model deployment using Flask. Flask is a micro web framework written in Python. It can create a REST API that allows you to send data, and receive a prediction as a response.
- open Command Prompt(cmd)
- Inside cmd we can open folder n crate env foldr and download all library ### command(conda create--prefix ./env pandas numpy matplotlib scikit-learn jupyter flask)
- Activate anaconda in the cmd through the command ##### Active Conda.
- After activate conda the give path using Active conda cd (path of the env)
- Opening the env folder then run active cd (main foldr path)
- Run the app1.py file in cmd using command ### python app1.py
- After run command all library activated then one localhost address show in cmd copy address
- The localhost address past in web browser and run