/feature-engineering-handbook

A practical feature engineering handbook

Primary LanguageJupyter Notebook

Feature-Engineering-Handbook

Welcome! This repo provides an interactive and complete practical feature engineering tutorial in Jupyter Notebook. It contains three parts: Data Prepocessing, Feature Selection and Dimension Reduction. Each part is demonstrated separately in one notebook. Since some feature selection algorithms such as Simulated Annealing and Genetic Algorithm lack complete implementation in python, we also provide corresponding python scripts (Simulated Annealing, Genetic Algorithm) and cover them in our tutorial for your reference.

Brief Introduction

Table of Content

  • 1  Data Prepocessing
    • 1.1  Static Continuous Variables
      • 1.1.1  Discretization
        • 1.1.1.1  Binarization
        • 1.1.1.2  Binning
      • 1.1.2  Scaling
        • 1.1.2.1  Stardard Scaling (Z-score standardization)
        • 1.1.2.2  MinMaxScaler (Scale to range)
        • 1.1.2.3  RobustScaler (Anti-outliers scaling)
        • 1.1.2.4  Power Transform (Non-linear transformation)
      • 1.1.3  Normalization
      • 1.1.4  Imputation of missing values
        • 1.1.4.1  Univariate feature imputation
        • 1.1.4.2  Multivariate feature imputation
        • 1.1.4.3  Marking imputed values
      • 1.1.5  Feature Transformation
        • 1.1.5.1  Polynomial Transformation
        • 1.1.5.2  Custom Transformation
    • 1.2  Static Categorical Variables
      • 1.2.1  Ordinal Encoding
      • 1.2.2  One-hot Encoding
      • 1.2.3  Hashing Encoding
      • 1.2.4  Helmert Coding
      • 1.2.5  Sum (Deviation) Coding
      • 1.2.6  Target Encoding
      • 1.2.7  M-estimate Encoding
      • 1.2.8  James-Stein Encoder
      • 1.2.9  Weight of Evidence Encoder
      • 1.2.10  Leave One Out Encoder
      • 1.2.11  Catboost Encoder
    • 1.3  Time Series Variables
      • 1.3.1  Time Series Categorical Features
      • 1.3.2  Time Series Continuous Features
      • 1.3.3  Implementation
        • 1.3.3.1  Create EntitySet
        • 1.3.3.2  Set up cut-time
        • 1.3.3.3  Auto Feature Engineering
  • 2  Feature Selection
    • 2.1  Filter Methods
      • 2.1.1  Univariate Filter Methods
        • 2.1.1.1  Variance Threshold
        • 2.1.1.2  Pearson Correlation (regression problem)
        • 2.1.1.3  Distance Correlation (regression problem)
        • 2.1.1.4  F-Score (regression problem)
        • 2.1.1.5  Mutual Information (regression problem)
        • 2.1.1.6  Chi-squared Statistics (classification problem)
        • 2.1.1.7  F-Score (classification problem)
        • 2.1.1.8  Mutual Information (classification problem)
      • 2.1.2  Multivariate Filter Methods
        • 2.1.2.1  Max-Relevance Min-Redundancy (mRMR)
        • 2.1.2.2  Correlation-based Feature Selection (CFS)
        • 2.1.2.3  Fast Correlation-based Filter (FCBF)
        • 2.1.2.4  ReliefF
        • 2.1.2.5  Spectral Feature Selection (SPEC)
    • 2.2  Wrapper Methods
      • 2.2.1  Deterministic Algorithms
        • 2.2.1.1  Recursive Feature Elimination (SBS)
      • 2.2.2  Randomized Algorithms
        • 2.2.2.1  Simulated Annealing (SA)
        • 2.2.2.2  Genetic Algorithm (GA)
    • 2.3  Embedded Methods
      • 2.3.1  Regulization Based Methods
        • 2.3.1.1  Lasso Regression (Linear Regression with L1 Norm)
        • 2.3.1.2  Logistic Regression (with L1 Norm)
        • 2.3.1.3  LinearSVR/ LinearSVC
      • 2.3.2  Tree Based Methods
  • 3  Dimension Reduction
    • 3.1  Unsupervised Methods
      • 3.1.1  PCA (Principal Components Analysis)
    • 3.2  Supervised Methods
      • 3.2.1  LDA (Linear Discriminant Analysis)

Reference

References have been included in each Jupyter Notebook.

Author

@Yingxiang Chen
@Zihan Yang

Contact

If there are any mistakes, please feel free to reach out and correct us!

Yingxiang Chen E-mail: chenyingxiang3526@gmail.com
Zihan Yang E-mai: echoyang48@gmail.com