- Overview
- Dataset
- Data Preprocessing
- Exploratory Data Analysis
- Feature Engineering
- Model Development
- Results
- Installation
- Usage
- Contributing
- License
This project implements a machine learning pipeline for predicting loan approvals. It uses various techniques in data preprocessing, exploratory data analysis, feature engineering, and machine learning modeling to achieve high predictive accuracy.
The project uses two datasets:
loan_train.csv
: Training dataset with 491 entries and 13 featuresloan_test.csv
: Test dataset with 123 entries and 12 features (excluding the target variable)
Key features include:
- Loan_ID
- Gender
- Married
- Dependents
- Education
- Self_Employed
- ApplicantIncome
- CoapplicantIncome
- LoanAmount
- Loan_Amount_Term
- Credit_History
- Property_Area
- Loan_Status (target variable)
-
Handling missing values:
- Mode imputation for categorical variables (Gender, Married, Dependents, Self_Employed, Credit_History)
- Median imputation for numerical variables (LoanAmount)
- Mode imputation for Loan_Amount_Term
-
Outlier treatment:
- Log transformation applied to LoanAmount to handle right skewness
-
Encoding categorical variables:
- One-hot encoding used for all categorical variables
-
Univariate analysis:
- Distribution of loan approval status
- Distribution of categorical variables (Gender, Married, Dependents, Education, Self_Employed, Credit_History, Property_Area)
- Distribution of numerical variables (ApplicantIncome, CoapplicantIncome, LoanAmount)
-
Bivariate analysis:
- Relationship between categorical variables and loan approval status
- Relationship between numerical variables and loan approval status
-
Correlation analysis:
- Heatmap of correlation between numerical variables
Based on domain knowledge and insights from EDA, we created the following new features:
- Total_Income: Combination of ApplicantIncome and CoapplicantIncome
- Total_Income_log: Log transformation of Total_Income
- EMI: LoanAmount divided by Loan_Amount_Term
- Balance Income: Total_Income minus (EMI * 1000)
Four models were implemented and evaluated:
- Logistic Regression
- Decision Tree
- Random Forest (Pending)
- XGBoost (Pending)
Each model was evaluated using 5-fold stratified cross-validation to ensure robust performance estimation.
Here's a summary of the model performances:
Model | Mean Validation Accuracy | Mean Validation F1 Score | AUC |
---|---|---|---|
Logistic Regression | 0.7881 | 0.8622 | 0.8176 |
Decision Tree | 0.7168 | 0.8027 | 0.7500 |
Random Forest | Not implemented | Not implemented | N/A |
XGBoost | Not implemented | Not implemented | N/A |
Note: The Random Forest and XGBoost models were mentioned in the project plan but not implemented in the provided notebook.
git clone https://github.com/riziuzi/loan-prediction.git
cd loan-prediction
pip install -r requirements.txt
Contributions to this project are welcome. Please fork the repository and submit a pull request with your proposed changes.
Created by riziuzi