This project aims to build a predictive model that will tell if a person is a good or bad payer based on the financial data of that given person.
- Developed a tool to predict whether a customer is likely to be a good or bad payer with a minimum accuracy of 75%.
- Evaluated and compare the performance of two models, logistic regression and naive Bayes, to determine the best model for the task.
- Performed data cleaning and preprocessing to ensure the dataset is suitable for exploratory data analysis (EDA) and model training.
- Conducted an EDA to gain insights into the dataset, including analyzing the distribution of various attributes and their relationships with the target variable.
- Visualized the distributions of selected attributes using appropriate charts and plots.
Python Version: 3.10.8
packages: pandas, numpy, matplotlib, seaborn, sklearn
The dataset contains 1000 records and you can encounter in this link: financial_data.
It includes the following attributes:
- Status of existing checking account (qualitative)
- Duration in months (numerical)
- Credit history (qualitative)
- Purpose (qualitative)
- Credit amount (numerical)
- Savings account/bonds (qualitative)
- Present employment since (qualitative)
- Installment rate in percentage of disposable income (numerical)
- Personal status and sex (qualitative)
- Other debtors/guarantors (qualitative)
- Present residence since (numerical)
- Property (qualitative)
- Age in years (numerical)
- Other installment plans (qualitative)
- Housing (qualitative)
- Number of existing credits at this bank (numerical)
- Job (qualitative)
- Number of people being liable to provide maintenance for (numerical)
- Telephone (qualitative)
- Foreign worker (qualitative)
- Response variable: 1 (bad) or 2 (good)
I didn't need to make a lot of changes, but i did the following:
- Verified if had any null values
- replaced the attributes index for its true meaning for the EDA
- changed the datatypes of the columns to int for the model
Some findings of the EDA:
-
70% of the dataset is made by bad payers (700 rows)
-
most of the clients have existing credits paid back till now
-
the credit amount attribute is a exponencial distribution having its pick around 2000 DM.
First, i transformed the columns to int type to fit into the models. After, i separeted the dataset in train and test.
I tried two different models:
- logistic regression, accuracy: 75%
- Naive Bayes, accuracy: 74%
So, the logistic model is better in terms of accuracy.