Applying CRISP-DM methodology for predicting Loan Elegibility
The company Dream Housing Finance (DHF) deals in all home loans. They have a presence across all urban, semi-urban and rural areas. Customers first apply for a home loan after that company validates the customer's eligibility for a loan. The company wants to automate the loan eligibility process (real-time) based on customer detail provided while filling out an online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History, and others. To automate this process, they have given a problem to identify the customer segments that are eligible for loan amounts to specifically target these customers that seem safe to borrow money from DHF. DHF owns data from its past operation that is regarded as one of its most valuable business assets. We can find the data in the loan.csv dataset.
The following report presents the CRISP-DM phases developed in a Jupyter Notebook (Python 3) using the libraries: NumPy, pandas, matplotlib, seaborn, stats model, sklearn, among others. We stored the loan.csv dataset in a data frame which we thoroughly manipulated to obtain the correct predictions of the prediction.csv.
The dataset initially contained information on 618 customers who have asked for a loan to DHF. The information in these clients appears across 12 different independent features, both categorial and numerical.
Dataset Key Information
Key Name | Description |
---|---|
Loan_ID | Unique Loan ID |
Gender | Male/ Female |
Married | Applicant married (Y/N) |
Dependents | Number of dependents |
Education | Applicant Education (Graduate/ Under Graduate) |
Self_Employed | Self-employed (Y/N) |
ApplicantIncome | Applicant income |
CoapplicantIncome | Coapplicant income |
LoanAmount | Loan amount in thousands |
Loan_Amount_Term | Term of a loan in months |
Credit_History | credit history meets guidelines |
Property_Area | Urban/ Semi-Urban/ Rural |
Loan_Status | Loan approved (Y/N) |
First of all, we deleted the variable Loan_ID
since it provided no meaningful information. Next, we changed the data type of Credit_History
from numerical to categorical since we found out that its values only lay among 0 and 1. Where 0 meant that they did not have a credit history; otherwise, its value was one. Then, we divided the data frame into two datasets, one that contained the information of categorical variables and the other with the numerical ones.
From here, we analyzed the percentage of missing data in each dataset. Five of the categorical variables had missing data (Gender
, Married
, Dependents
, Self_Employed
, and Credit_History
). To correct this, we used the SimpleImputer
functions from the sklearn library, where we imputed the missing values with the most frequent values from each column.
Then, we analyzed the numerical variables, two of which contained missing data. We use the KNNImputer
that fills missing values through the implementation of the k-Nearest Neighbor approach to treat this issue.
Additionally, we analyzed the data balance through the target variable Loan_Status, where we noticed that 422 customers were granted a loan and only 192 did not. The unbalanced target variable shows that the proportion of the category “Y” is 2.2 concerning category “N.” Thus, we proceeded into developing a random oversampling in the type “N,” which results in 422 customers for each category.
Furthermore, we transformed the categorical variables to numerical by implementing label and one-hot Encoding (to those with more than two unique variables). After this transformation, the categorical data lied between 0 and 1; thus, we needed to apply a min-max scaling transformation to numerical data. With this transformation, the magnitude of the numerical values is equal to the categorical.
Next, we performed the Exploratory Data Analysis (EDA) for the categorical data. For this analysis, we developed a series of bar plots that convey the frequency of the unique values of each variable. Additionally, we also developed a plot with the proportions regarding if the loans had been accepted or not. From these visualizations, we concluded that men ask more for loans than women; however, both genders have the same probability of being granted a loan. Therefore, gender is not a determining factor in predicting if a loan should be granted or not.
Furthermore, more married clients ask for loans, and they have a higher probability of being accepted. Customers who have no dependents have a significantly higher likelihood of asking for a loan. Still, curiously, if the client has two dependents, they have a higher probability of being granted the loan. Moreover, if the applicant is not graduated from college, they have a lower likelihood of being accepted for a loan. The customer's credit history is the most significant variable since it shows that the customers who have a credit history appear to have an almost 80% probability of being granted a loan. However, if the customer does not have a credit history, they have only a 7.9% chance of getting a loan. Finally, suppose the property is in a suburban area. In that case, the client has a higher probability of being granted a loan than if the property was in an urban or rural area.
Furthermore, for the numerical data, we first visualized the distribution of each variable through a histogram, where we see that most of them are skewed to the right, except for the Loan_Amount_Term
variable. Additionally, we developed a series of boxplots that describe the distribution of the data of each numerical variable and its loan status. From these plots, we can see that all of them convey a very similar distribution between the data of the accepted and rejected loans. Thus, they appear to be non-meaningful for predicting the target variable. For more in-depth information, we developed a series of ANOVA F-tests to prove that the mean of each numerical variable for accepted loans is the same as that for unaccepted. The results of these tests prove that it seems that the numerical data does not influence the decision of whether the loan is accepted or not.
Variable | p-value | Null Hypothesis |
---|---|---|
ApplicantIncome | 0.390 | Not rejected |
CoapplicantIncome | 0.072 | Not rejected |
LoanAmount | 0.66 | Not rejected |
Loan_Amount_Term | 0.328 | Not rejected |
To finalize EDA, we developed a correlation analysis heatmap, where we convey the linear relationship in a correlation matrix among all variables. This heatmap shows that the categorical variables resulting from the one-hot encoding (Dependents_0
, Dependents_1
, Dependents_2
, Dependents_3
, Property_Area_0
, Property_Area_1
, Property_Area_2
) have a pronounced negative linear correlation among them. However, we can see that married status has a 37% correlation with gender. If the person is male, they have a higher probability of also being married. Additionally, there is a 25% correlation between being married and having two dependents, which makes sense since most of the time, married couples who have a growing family, tend to ask for loans to buy a bigger house. Furthermore, the amount of money being loaned has a 57% correlation with the applicant's income. This conclusion makes sense since the more an applicant gains, the more they would be able to pay; thus, the more they are loaned.
Finally, we analyzed the variables that appeared to correlate significantly with our target variable, Loan_Status
. We see that people who are asking for loans in the urban area have an 18% probability of being granted the loan. Additionally, if the applicant is married, they have a 13% probability of having an accepted loan. Also, we can conclude that having a credit history has the most influence on whether the loan is accepted or rejected, with a 51% correlation.
Furthermore, we performed Kendall’s Rank Correlation hypothesis test to determine whether the correlation between the categorical target variable Loan_Status
and the independent numerical variables is significant. For these tests, the null hypothesis would be that both variables are independent. Therefore, this analysis concludes that the numerical variables seem to be independent of the target variable.
Variable | correlation | p-value | Null Hypothesis |
---|---|---|---|
ApplicantIncome | 0.004 | 0.890 | Not rejected |
CoapplicantIncome | 0.057 | 0.059 | Not rejected |
LoanAmount | -0.007 | 0.796 | Not rejected |
Loan_Amount_Term | -0.020 | 0.549 | Not rejected |
Moreover, we developed the Married
, Education
, Credit_History
, Dependents_2
, Property_Area_1
, and Property_Area_2
seem to have a significant correlation with the target variable.
Variable | stat | p-value | Null Hypothesis |
---|---|---|---|
Gender | 1.279 | 0.258 | Not reject |
Married | 13.594 | 0.000 | Reject |
Education | 8.027 | 0.005 | Reject |
Self_Employed | 0.476 | 0.490 | Not reject |
Credit_History | 214.454 | 0.000 | Reject |
Dependendents_0 | 0.000 | 1.000 | Not reject |
Dependendents_1 | 1.013 | 0.314 | Not reject |
Dependendents_2 | 3.242 | 0.072 | Reject |
Dependendents_3 | 0.726 | 0.394 | Not reject |
Property_Area_0 | 2.102 | 0.147 | Not reject |
Property_Area_1 | 25.870 | 0.000 | Reject |
Property_Area_2 | 12.630 | 0.000 | Reject |
We developed an RFE feature selection strategy to finalize the data preparation phase. Through this technique we found the most significant variables that predict the Loan_Status
. We first got the optimum number of features according to a Logistic Regression model. This analysis concluded that we required only six variables to achieve the highest accuracy possible. Next, we applied for this number again to the logistic regression. As a result, we obtained that Married
, Credit_History
, Property_Area_1
, ApplicantIncome
, CoapplicantIncome
, and LoanAmount
were the most valuable variables to predict Loan_Status
.
Now that we have this information, we developed a VIF analysis of the significant variables, from which we obtained the results below.
Variable | stat |
---|---|
Married 2.267208 | |
Credit_History | 2.796159 |
Property_Area_1 | 1.462494 |
ApplicantIncome | 3.139574 |
CoapplicantIncome | 1.419624 |
LoanAmount | 5.755920 |
This analysis tells us that the essential variables do not contain multicollinearity problems, so all of these may be part of the final data to feed the model.
For this phase, we applied decision trees, logistic regression, support vector machines, Naïve Bayes, and ensemble methods bagging, boosting, and XGB models to predict the target variable Loan_Status. We first divided the data into the dependent and independent variables, called y and X, respectively, to evaluate these models. Then, we divided them into training and test set, with an 80%-20% ratio.
For the evaluation of the algorithms, we developed a K Fold Cross-Validation such as in the figure below. This technique consists of the training set partitioning into equal-sized K parts. One of these partitions is used as a validation set and the rest as a sub-training set. For this strategy, we defined a total of 10 partitions.
Now we show the mean of the metrics for each algorithm in terms of ROC/AUC, accuracy, precision, recall, and f1 score.
Algorithm | ROC AUC Mean | Accuracy Mean | Precision Mean | Recall Mean | F1 Score Mean |
---|---|---|---|---|---|
Random Forest | 92.65 | 85.37 | 87.00 | 83.60 | 84.98 |
XGB | 90.82 | 84.07 | 85.57 | 81.99 | 83.51 |
GBM | 87.69 | 80.93 | 77.17 | 86.98 | 81.66 |
Decision Tree Classifier | 84.50 | 84.44 | 86.52 | 81.36 | 83.73 |
Logistic Regression | 81.00 | 72.41 | 66.97 | 88.72 | 75.91 |
SVM | 79.10 | 71.11 | 66.13 | 87.99 | 74.66 |
Naïve Bayes | 78.81 | 73.33 | 65.91 | 96.97 | 78.31 |
Next, we convey the boxplots that contain the distribution of the accuracy score, and ROC/AUC score for each algorithm. From these plots, we can conclude that some of the best models for fitting these data are GBM, XGB, Decision Tree, and Random Forest. These models achieve significantly higher accuracy and ROC/AUC scores than others.
For the next iteration, we developed some hyperparameter tuning for the algorithms that attained the best accuracy and AUC scores (GBM, XGB, Decision Tree, and Random Forest) so that they may further improve their metrics. The hyperparameter tuning was developed through the use of GridSearchCV, this function allowed us to try different values for several parameter. Where the results were the following:
- The best parameters for Random Forest are {'criterion': 'gini', 'max_depth': 13, 'min_samples_leaf': 1, 'n_estimators': 45}
- The best parameters for GBM are {'learning_rate': 0.1, 'n_estimators': 250, 'max_depth': 4}
- The best parameters for Decision Tress are {'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 5}
- The best parameters for XGB are {'gamma': 0.1, 'max_depth': 15, 'min_child_weight': 1}
The resulting metrics of these changes appear below:
Model | ROC/AUC | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|---|
Random Forest | 0.928445 | 0.866667 | 0.878378 | 0.878378 | 0.878378 |
Gradient Boosting | 0.919805 | 0.844444 | 0.884058 | 0.824324 | 0.853147 |
XGB | 0.888569 | 0.829630 | 0.892308 | 0.783784 | 0.834532 |
Decision Tree | 0.849801 | 0.785185 | 0.816901 | 0.783784 | 0.800000 |
Support Vector Machine | 0.763181 | 0.718519 | 0.695652 | 0.864865 | 0.771084 |
Logistic Regression | 0.753434 | 0.711111 | 0.701149 | 0.824324 | 0.757764 |
Naive Bayes | 0.745680 | 0.718519 | 0.680000 | 0.918919 | 0.781609 |
This table shows that the Random Forest became the optimal model to make predictions for the given dataset, as it has relatively the highest combination of AUC, accuracy, precision, recall, and F1 scores. Hence, we used this model to evaluate its performance when training in the entire train dataset and testing in the true testing dataset.
Model | ROC/AUC | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|---|
Random Forest | 0.920244 | 0.87574 | 0.857143 | 0.888889 | 0.872727 |
We applied the best model obtained to make predictions using the attached predict.csv as the test set and loan.csv as training set. We created a new Jupyter Notebook (CRISP-DM (Deployment Phase).ipynb) that cleans the predict.csv dataset by imputing missing values, labeling and applying one-hot encoding on categorical data, transforming continuous values into integers, and selecting the essential features. The predictions for each user appear in Model Predictions.xlsx.
In conclusion, we used the Dream Housing Finance Loan dataset to build a machine learning classifier to automate the loan eligibility process. This model attained a reasonable accuracy score of 87.5%. Additionally, according to this analysis, we can conclude that the customer segments that DHF should target are applicants that appear to be married and are looking for a property in the suburban area. This situation could mean that they may be planning to grow a family; thus, they have a higher probability of being responsible for avoiding debts. Furthermore, these applicants and their co-applicants, should count on a high amount of income. If DHF targets people who follow these characteristics, they can ensure that customers will be capable of paying back; hence, DHF will be more secure in lending a higher amount of money to them. Finally, and most importantly, ensure that the person has a credit history because applicants who have repaid their previous debts have a significantly higher probability of repaying this one.