/awesome-imbalanced-learning

A curated list of awesome imbalanced learning papers, codes, frameworks and libraries. | 类别不平衡学习:论文/代码/框架/库

Creative Commons Zero v1.0 UniversalCC0-1.0

Awesome Imbalanced Learning

Awesome

A curated list of awesome imbalanced learning papers, codes, frameworks and libraries.

Class-imbalance (also known as the long-tail problem) is the fact that the classes are not represented equally in a classification problem, which is quite common in practice. For instance, fraud detection, prediction of rare adverse drug reactions and prediction gene families. Failure to account for the class imbalance often causes inaccurate and decreased predictive performance of many classification algorithms. Imbalanced learning aims to tackle the class imbalance problem to learn an unbiased model from imbalanced data.

Inspired by awesome-machine-learning. Contributions are welcomed!

Items marked with 🉑 are personally recommended (important/high-quality papers or libraries).

Table of Contents

Libraries

Python

R

Java

  • KEEL [Github][Paper] - KEEL provides a simple GUI based on data flow to design experiments with different datasets and computational intelligence algorithms (paying special attention to evolutionary algorithms) in order to assess the behavior of the algorithms. This tool includes many widely used imbalanced learning techniques such as (evolutionary) over/under-resampling, cost-sensitive learning, algorithm modification, and ensemble learning methods.

    🉑 wide variety of classical classification, regression, preprocessing algorithms included.

Scalar

Julia

  • smote_variants [Documentation][Github] - A collection of 85 minority over-sampling techniques for imbalanced learning with multi-class oversampling and model selection feature (support R and Julia).

Papers

Surveys

  • Learning from imbalanced data (2009, 4700+ citations) - Highly cited, classic survey paper. It systematically reviewed the popular solutions, evaluation metrics, and challenging problems in future research in this area (as of 2009).

    🉑 classic work.

  • Learning from imbalanced data: open challenges and future directions (2016, 400+ citations) - This paper concentrates on discussing the open issues and challenges in imbalanced learning, such as extreme class imbalance, dealing imbalance in online/stream learning, multi-class imbalanced learning, and semi/un-supervised imbalanced learning.
  • Learning from class-imbalanced data: Review of methods and applications (2017, 400+ citations) - A recent exhaustive survey of imbalanced learning methods and applications, a total of 527 papers were included in this study. It provides several detailed taxonomies of existing methods and also the recent trend of this research area.

    🉑 a systematic survey with detailed taxonomies of existing methods.

Deep Learning

Data resampling

  • Over-sampling

    • ROS [Code] - Random Over-sampling
    • SMOTE [Code] (2002, 9800+ citations) - Synthetic Minority Over-sampling TEchnique

      🉑 classic work.

    • Borderline-SMOTE [Code] (2005, 1400+ citations) - Borderline-Synthetic Minority Over-sampling TEchnique
    • ADASYN [Code] (2008, 1100+ citations) - ADAptive SYNthetic Sampling
    • SPIDER [Code (Java)] (2008, 150+ citations) - Selective Preprocessing of Imbalanced Data
    • Safe-Level-SMOTE [Code (Java)] (2009, 370+ citations) - Safe Level Synthetic Minority Over-sampling TEchnique
    • SVM-SMOTE [Code] (2009, 120+ citations) - SMOTE based on Support Vectors of SVM
    • SMOTE-IPF (2015, 180+ citations) - SMOTE with Iterative-Partitioning Filter
  • Under-sampling

    • RUS [Code] - Random Under-sampling
    • CNN [Code] (1968, 2100+ citations) - Condensed Nearest Neighbor
    • ENN [Code] (1972, 1500+ citations) - Edited Condensed Nearest Neighbor
    • TomekLink [Code] (1976, 870+ citations) - Tomek's modification of Condensed Nearest Neighbor
    • NCR [Code] (2001, 500+ citations) - Neighborhood Cleaning Rule
    • NearMiss-1 & 2 & 3 [Code] (2003, 420+ citations) - Several kNN approaches to unbalanced data distributions.
    • CNN with TomekLink [Code (Java)] (2004, 2000+ citations) - Condensed Nearest Neighbor + TomekLink
    • OSS [Code] (2007, 2100+ citations) - One Side Selection
    • EUS (2009, 290+ citations) - Evolutionary Under-sampling
    • IHT [Code] (2014, 130+ citations) - Instance Hardness Threshold
  • Hybrid-sampling

    • SMOTE-Tomek & SMOTE-ENN (2004, 2000+ citations) [Code (SMOTE-Tomek)] [Code (SMOTE-ENN)] - Synthetic Minority Over-sampling TEchnique + Tomek's modification of Condensed Nearest Neighbor/Edited Nearest Neighbor

      🉑 extensive experimental evaluation involving 10 different over/under-sampling methods.

    • SMOTE-RSB (2012, 210+ citations) - Hybrid Preprocessing using SMOTE and Rough Sets Theory

Cost-sensitive Learning

  • CSC4.5 [Code (Java)] (2002, 420+ citations) - An instance-weighting method to induce cost-sensitive trees
  • CSSVM [Code (Java)] (2008, 710+ citations) - Cost-sensitive SVMs for highly imbalanced classification
  • CSNN [Code (Java)] (2005, 950+ citations) - Training cost-sensitive neural networks with methods addressing the class imbalance problem.

Ensemble Learning

  • Boosting-based

    • AdaBoost [Code] (1995, 18700+ citations) - Adaptive Boosting with C4.5
    • DataBoost (2004, 570+ citations) - Boosting with Data Generation for Imbalanced Data
    • SMOTEBoost [Code] (2003, 1100+ citations) - Synthetic Minority Over-sampling TEchnique Boosting

      🉑 classic work.

    • MSMOTEBoost (2011, 1300+ citations) - Modified Synthetic Minority Over-sampling TEchnique Boosting
    • RAMOBoost [Code] (2010, 140+ citations) - Ranked Minority Over-sampling in Boosting
    • RUSBoost [Code] (2009, 850+ citations) - Random Under-Sampling Boosting

      🉑 classic work.

    • AdaBoostNC (2012, 350+ citations) - Adaptive Boosting with Negative Correlation Learning
    • EUSBoost (2013, 210+ citations) - Evolutionary Under-sampling in Boosting
  • Bagging-based

  • Other forms of ensemble

    • EasyEnsemble & BalanceCascade [Code (EasyEnsemble)] [Code (BalanceCascade)] (2008, 1300+ citations) - Parallel ensemble training with RUS (EasyEnsemble) / Cascade ensemble training with RUS while iteratively drops well-classified examples (BalanceCascade)

      🉑 simple but effective solution.

    • Self-paced Ensemble [Code] (ICDE 2020) - Training Effective Ensemble on Imbalanced Data by Self-paced Harmonizing Classification Hardness

      🉑 high performance & computational efficiency & widely applicable to different classifiers.

Anomaly Detection

Others

Imbalanced Datasets

ID Name Repository & Target Ratio #S #F
1 ecoli UCI, target: imU 8.6:1 336 7
2 optical_digits UCI, target: 8 9.1:1 5,620 64
3 satimage UCI, target: 4 9.3:1 6,435 36
4 pen_digits UCI, target: 5 9.4:1 10,992 16
5 abalone UCI, target: 7 9.7:1 4,177 10
6 sick_euthyroid UCI, target: sick euthyroid 9.8:1 3,163 42
7 spectrometer UCI, target: > =44 11:1 531 93
8 car_eval_34 UCI, target: good, v good 12:1 1,728 21
9 isolet UCI, target: A, B 12:1 7,797 617
10 us_crime UCI, target: >0.65 12:1 1,994 100
11 yeast_ml8 LIBSVM, target: 8 13:1 2,417 103
12 scene LIBSVM, target: >one label 13:1 2,407 294
13 libras_move UCI, target: 1 14:1 360 90
14 thyroid_sick UCI, target: sick 15:1 3,772 52
15 coil_2000 KDD, CoIL, target: minority 16:1 9,822 85
16 arrhythmia UCI, target: 06 17:1 452 278
17 solar_flare_m0 UCI, target: M->0 19:1 1,389 32
18 oil UCI, target: minority 22:1 937 49
19 car_eval_4 UCI, target: vgood 26:1 1,728 21
20 wine_quality UCI, wine, target: <=4 26:1 4,898 11
21 letter_img UCI, target: Z 26:1 20,000 16
22 yeast_me2 UCI, target: ME2 28:1 1,484 8
23 webpage LIBSVM, w7a, target: minority 33:1 34,780 300
24 ozone_level UCI, ozone, data 34:1 2,536 72
25 mammography UCI, target: minority 42:1 11,183 6
26 protein_homo KDD CUP 2004, minority 111:1 145,751 74
27 abalone_19 UCI, target: 19 130:1 4,177 10

Note: This collection of datasets is from imblearn.datasets.fetch_datasets.

Other Resources