/ADASYN

Adaptive Synthetic Sampling Approach for Imbalanced Learning

Primary LanguagePythonMIT LicenseMIT

Adaptive Synthetic Sampling Approach for Imbalanced Learning

ADASYN is a python module that implements an adaptive oversampling technique for skewed datasets.

Many ML algorithms have trouble dealing with largely skewed datasets. If your dataset is 1000 examples and 950 of them belong to class 'Haystack' and the rest 50 belong to class 'Needle' it gets hard to predict new unseen data that belong to 'Needle' . What this algorithm does is create new artificial data that belong to the minority class by adding some semi-random noise to existing examples. For more information read the full paper

Dependencies

  • pip (needed for install)
  • numpy
  • scipy
  • scikit-learn

Installation

To use ADASYN you will need to running the following :

  pip install git+https://github.com/stavskal/ADASYN    

After you have installed the packages you can proceed with using:

from adasyn import ADASYN
adsn = ADASYN(k=7,imb_threshold=0.6, ratio=0.75)
new_X, new_y = adsn.fit_transform(X,y)  # your imbalanced dataset is in X,y

# In many applications you may want to keep artificial data separately
# adsn.index_new is a list that holds the indexes of these examples

Original paper can be found here

This module implements the idea presented in the paper by Haibo He et al. and also includes oversampling for multiclass classification problems. It is designed to be compatible with [scikit-learn] (https://github.com/scikit-learn/scikit-learn). It focuses on oversampling the examples that are harder to classify and has shown results which sometimes outperform SMOTE or SMOTEboost.

An example can be seen below:

alt tag

Props to fmfn who implemented different oversampling techniques for his good code structure, which highly influenced this module, and documentation

Reference:

  1. H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning,” in Proc. Int. Joint Conf. Neural Networks (IJCNN’08), pp. 1322-1328, 2008.