Loan Prediction Project

Overview
Dataset
Data Preprocessing
Exploratory Data Analysis
Feature Engineering
Model Development
Results
Installation
Usage
Contributing
License

Overview

This project implements a machine learning pipeline for predicting loan approvals. It uses various techniques in data preprocessing, exploratory data analysis, feature engineering, and machine learning modeling to achieve high predictive accuracy.

Dataset

The project uses two datasets:

loan_train.csv: Training dataset with 491 entries and 13 features
loan_test.csv: Test dataset with 123 entries and 12 features (excluding the target variable)

Key features include:

Loan_ID
Gender
Married
Dependents
Education
Self_Employed
ApplicantIncome
CoapplicantIncome
LoanAmount
Loan_Amount_Term
Credit_History
Property_Area
Loan_Status (target variable)

Data Preprocessing

Handling missing values:
- Mode imputation for categorical variables (Gender, Married, Dependents, Self_Employed, Credit_History)
- Median imputation for numerical variables (LoanAmount)
- Mode imputation for Loan_Amount_Term
Outlier treatment:
- Log transformation applied to LoanAmount to handle right skewness
Encoding categorical variables:
- One-hot encoding used for all categorical variables

Exploratory Data Analysis

Univariate analysis:
- Distribution of loan approval status
- Distribution of categorical variables (Gender, Married, Dependents, Education, Self_Employed, Credit_History, Property_Area)
- Distribution of numerical variables (ApplicantIncome, CoapplicantIncome, LoanAmount)
Bivariate analysis:
- Relationship between categorical variables and loan approval status
- Relationship between numerical variables and loan approval status
Correlation analysis:
- Heatmap of correlation between numerical variables

Feature Engineering

Based on domain knowledge and insights from EDA, we created the following new features:

Total_Income: Combination of ApplicantIncome and CoapplicantIncome
Total_Income_log: Log transformation of Total_Income
EMI: LoanAmount divided by Loan_Amount_Term
Balance Income: Total_Income minus (EMI * 1000)

Model Development

Four models were implemented and evaluated:

Logistic Regression
Decision Tree
Random Forest (Pending)
XGBoost (Pending)

Each model was evaluated using 5-fold stratified cross-validation to ensure robust performance estimation.

Results

Here's a summary of the model performances:

Model	Mean Validation Accuracy	Mean Validation F1 Score	AUC
Logistic Regression	0.7881	0.8622	0.8176
Decision Tree	0.7168	0.8027	0.7500
Random Forest	Not implemented	Not implemented	N/A
XGBoost	Not implemented	Not implemented	N/A

Note: The Random Forest and XGBoost models were mentioned in the project plan but not implemented in the provided notebook.

Installation

git clone https://github.com/riziuzi/loan-prediction.git
cd loan-prediction
pip install -r requirements.txt

Contributing

Contributions to this project are welcome. Please fork the repository and submit a pull request with your proposed changes.

Created by riziuzi

riziuzi/Loan-Predictor