/Credit-Risk-Prediction

Predicting credit risk when a person requests for loan using random forest on south German dataset (fixing imbalanced data)

Primary LanguageJupyter Notebook

Credit-Risk-Prediction

Credit risk infers to the possibility of a loss emerging from a borrower's downfall to pay back a loan or meet contractual commitments. Conventionally, it pertains to the risk arising from lenders' inability to return the owed interest and principal, impacting the cash flows and increasing assemblage costs. We used German credit data using Random forest (along with over-sampling and under-sampling methods to fix imbalanced data).The aim is to predict credit risk when a person requests for loan. You have to build a model to predict whether the person, described by the attributes of this dataset, is a good (1) or a bad (0) credit risk?

Data Documentation

Credit risk of south german dataset contains 1000 samples and 21 attributes. A brief description for each attribute is as follows:

Categorical Features:

status

status of the debtor's checking account with the bank

  • 1 : no checking account
  • 2 : ... < 0 DM
  • 3 : 0<= ... < 200 DM
  • 4 : ... >= 200 DM / salary for at least 1 year

credit_history

history of compliance with previous or concurrent credit contracts

  • 0 : delay in paying off in the past
  • 1 : critical account/other credits elsewhere
  • 2 : no credits taken/all credits paid back duly
  • 3 : existing credits paid back duly till now
  • 4 : all credits at this bank paid bac

purpose

purpose for which the credit is needed

  • 0 : others
  • 1 : car (new)
  • 2 : car (used)
  • 3 : furniture/equipment
  • 4 : radio/television
  • 5 : domestic appliances
  • 6 : repairs
  • 7 : education
  • 8 : vacation
  • 9 : retraining
  • 10 : business

savings

debtor's savings

  • 1 : unknown/no savings account
  • 2 : ... < 100 DM
  • 3 : 100 <= ... < 500 DM
  • 4 : 500 <= ... < 1000 DM
  • 5 : ... >= 1000 DM

personal_status

combined information on sex and marital status

  • 1 : male : divorced/separated
  • 2 : female : non-single or male : single
  • 3 : male : married/widowed
  • 4 : female : single

other_debtors

Is there another debtor or a guarantor for the credit?

  • 1 : none
  • 2 : co-applicant
  • 3 : guarantor

other_installment_plans

installment plans from providers other than the credit-giving bank

  • 1 : bank
  • 2 : stores
  • 3 : none

housing

type of housing the debtor lives in

  • 1 : for free
  • 2 : rent
  • 3 : own

Binary features

people_liable

number of persons who financially depend on the debtor (i.e., are entitled to maintenance)

  • 1 : 3 or more
  • 2 : 0 to 2

telephone

Is there a telephone landline registered on the debtor's name?

  • 1 : no
  • 2 : yes (under customer name)

foreign_worker

Is the debtor a foreign worker?

  • 1 : yes
  • 2 : no

Ordinal features:

employment_duration

duration of debtor's employment with current employer

  • 1 : unemployed
  • 2 : < 1 yr
  • 3 : 1 <= ... < 4 yrs
  • 4 : 4 <= ... < 7 yrs
  • 5 : >= 7 yrs

installment_rate

credit installments as a percentage of debtor's disposable income

  • 1 : >= 35
  • 2 : 25 <= ... < 35
  • 3 : 20 <= ... < 25
  • 4 : < 20

present_residence

length of time (in years) the debtor lives in the present residence

  • 1 : < 1 yr
  • 2 : 1 <= ... < 4 yrs
  • 3 : 4 <= ... < 7 yrs
  • 4 : >= 7 yr

property

the debtor's most valuable property, i.e. the highest possible code is used. Code 2 is used, if codes 3 or 4 are not applicable and there is a car or any other relevant property that does not fall under variable sparkont.

  • 1 : unknown / no property
  • 2 : car or other
  • 3 : building soc. savings agr./life insurance
  • 4 : real estate

number_credits

number of credits including the current one the debtor has (or had) at this bank

  • 1 : 1
  • 2 : 2-3
  • 3 : 4-5
  • 4 : >= 6

job

quality of debtor's job

  • 1 : unemployed/unskilled - non-resident
  • 2 : unskilled - resident
  • 3 : skilled employee/official
  • 4 : manager/self-empl./highly qualif. employee

Continuous features:

duration

credit duration in months

amount

credit amount in DM

age

age in years

Class labels:

credit_risk

Has the credit contract been complied with (good) or not (bad) ? (binary class)

  • 0 : bad
  • 1 : good

For more information please read the data documentation.

Code Explanation

The code aims to analyze credit risk data, visualize relationships, train a classifier, and address class imbalance issues. It uses Python libraries like pandas, numpy, seaborn, and scikit-learn. The goal is to predict credit risk based on various features in the dataset .

Importing Libraries

  • The code begins by importing necessary libraries such as pandas and numpy for data manipulation, and matplotlib and seaborn for visualization.

Loading Data:

  • The credit risk dataset is read from a file named "Data.asc" using pd.read_csv().
  • The shape of the loaded dataframe is printed.

Data Exploration

  • The top few rows of the dataset are displayed using credit_data.head().
  • A concise summary of the dataframe is printed using credit_data.info(), showing non-null counts and data types for each column.
  • Statistical summary of data is displayed using credit_data.describe().

Exploratory Data Analysis (EDA)

  • Visualizations are created to explore relationships between features and class labels.
  • Point plots show relationships between features like duration, amount, and employment duration for each class label.
  • Count plots display the distribution of credit history for each class label. imbalanced categories

Model Design and Evaluation

  • A random forest classifier is trained on the data.
  • Accuracy scores are computed for both training and test sets.

Handling Imbalanced Data

  • Techniques to address class imbalance are applied (oversampling and undersampling).
  • The performance metrics (precision, recall, F1-score) are evaluated for both class labels. imbalanced categories