/SECOM_class_imbalance

Approaches for the class imbalance problem (in semicondutor manufacturing process line data)

Primary LanguageJupyter NotebookMIT LicenseMIT

SECOM_class_imbalance

Approaches for the class imbalance problem (in semicondutor manufacturing process line data)

Data Description

The SECOM dataset in the UCI Machine Learning Repository is semicondutor manufacturing data which has 1567 records, 590 anonymized features and 104 fails. The process yield has a simple pass/fail response (encoded -1/1).

The dataset has the following characteristics:

  1. two-class problem
  2. an imbalance with a 14:1 skew of pass to fails
  3. large number of features -- 590
  4. missing data
  5. features/columns which do not have sufficient information
  6. 4% of the columns/features have more than 50% of their records missing
  7. some columns have constant values

Objective

The SECOM dataset presents us with two problems: (i) working with skewed data and (ii) feature selection. The main focus for this analysis will be the class imbalance issue and the ability to successfully predict fails. Strategies used in fraud/anomaly detection/rare disease diagnosis will be useful here. A secondary objective will be feature reduction. (In some to the literature pertaining to the SECOM dataset, this was the primary goal [1].) A streamlined feature set can not only lead to better prediction accuracy and data understanding but also save manufacturing resources.

Software

  • Python 2.7
  • scikit-learn packages for algorithms
  • pandas for data wrangling
  • Matplotlib and Seaborn for plotting and visualization

Methods

We will look at some of the approaches that deal with class imbalance. These can be a cost sensitive learning approach or sampling-based. We will also be working with feature selection methods. This is a list of methods we use:

  1. Random Forest variable importance (feature selection)
  2. One-class SVM
  3. SVM with SMOTE (oversampling minority class/undersampling majority class)
  4. SVM, Undersampling and Data Cleaning for Imbalanced Data
  5. Random Forest (weighting the classes)
  6. GBM

Further Reading

[1] McCann, Michael, et al. "Causality Challenge: Benchmarking relevant signal components for effective monitoring and process control." NIPS Causality: Objectives and Assessment.2010.
[2] H. He and E. A. Garcia, “Learning from Imbalanced Data,” IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009.