/ML-based-feature-selection

This notebook is a one stop shop for ML based feature selection for classification purpose. Once the master data is prepared, we can use this notebook to try out 8 feature selection techniques and combine their results

Primary LanguageJupyter Notebook

Feature Selection - Classification

Code Pipeline for using multiple feature selection methods for classification purpose



Contents

  1. Purpose
  2. Steps
  3. Data
  4. Instructions

Purpose:

This notebook is a one stop shop for ML based feature selection for classification purpose. Once the master data is prepared, we can use this notebook to try out 8 feature selection techniques and combine their results

Steps:

Step 1: Load Package and Custom Functions

  1. Install/Import necesarry packages
  2. Load custom functions
  3. Input user defined metrics

Step 2: Data Preparation:

  1. Import dataset
  2. Profiling and EDA
  3. Drop unwanted fields and convert column types
  4. Null Treatment
  5. Treating categorical features (one hot encoding, WOE encoding)
  6. Train/Test/OOT Split

Step 3: Feature Selection

Methods:

  1. Correlation: Pearson, Point Bi-Serial, Cramer's V reading materials
  2. Weight of Evidence and Information Value reading materials
  3. Beta Coefficients reading materials
  4. Lasso Regression reading materials
  5. Recursive Feature Selection reading materials
  6. Sequential Feature Selector reading materials
  7. BorutaPy reading materials
  8. BorutaShap reading materials

Features selected by majority of the methods will be picked for modelling.Users can select all methods or a subset of them

Data

We are using the (default of credit card clients) data. It has 30,000 records on customer default payments in Taiwan and has 23 features:

  1. Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
  2. Gender (1 = male; 2 = female).
  3. Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
  4. Marital status (1 = married; 2 = single; 3 = others).
  5. Age (year).
  6. History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
  7. Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
  8. Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

Instructions

Users need to update the manual inputs section and the rest of the notebook should run smoothly