/Home-Credit-Indonesia-Score-Card

Project Based intern as data scientist at Home Credit Indonesia cooperates with Rakamin Academy

Primary LanguageJupyter Notebook

Home-Credit-Indonesia-Score-Card

Project Based intern as data scientist at Home Credit Indonesia cooperates with Rakamin Academy

Business Problem :

  1. Home Credit Indonesia wants to create machine learning to help the team determine whether the loan applications from customers will experience problems in the credit repayment process or not.
  2. From the existing data, Home Credit Indonesia wants to find out what customer criteria are that have not problems in the credit repayment process to help increase revenue.

Business Insight :

image

The jobs with the most customers are accountant, so we need to create a campaign as a way of thanking them, then we need to make promotions for the other jobs in graph (There are HR Staff, IT staff and Realty agent) because the percentage of customers that have no problem paying the loans is quite high but the number of customer from that three job is still small

Machine learning Model :

  1. Replace unknown data

image

XNA maybe is a null data which is caused by an error in inputting data so null data is written as xna.

  1. Make a new column

image

I do it because I think the data from this new column is influential in determining whether the customer has a problem paying or not.

  1. Feature Engineering

3A. Numeric Feauture.

A. Remove outlier

image

I remove a outlier at CNT_CHILDREN because I think the number of children above 7 is abnormal data

B. Normalize Data

image

image

I divide the column with the numeric data type into 2, namely numeric which contains only 2 unique values ​​(ie 1 and 0) and numeric which contains more than 2 unique values, this is done because in my numeric data which contains 2 unique values it represents true and false values. so i just normalize numeric data with >2 unique value.

3B. Object Feature

image

I divide it into 2 group, first one will be one hot encoding because it have more than 2 unique value and second one will be label encoding because it have 2 unique value.

A. One hot encoding

image

B. Label Encoding

image

  1. Over sampling and undesampling Because we have imbalanced data, so we need to balanced it with oversampling or undersampling to make the model have a better accuracy value.

image

  1. Train Model

A. Logistic Regression

image

B. XGBoost

image

C. Random Forest

image

  1. Result of model

image

Random forest was the best model in this case because it have highest accuracy, precison,recall and F1-Score than others.

Summary

  1. For Revolving loans contract, we have to find customers with income types of businessmen, maternity leave, students and unemployed.

  2. It’s recommended to create campaigns for customers who work as accountants because accountants is the largest of number of customers and percentage of successful payments.

  3. Create advertisement or promotions for HR staff, IT staff and realty agents to apply for credit.

  4. Random forest is the machine learning model chosen to help the team determine whether a customer has a problem paying off a loan or not.