This is the code repository for Machine Learning for Imbalanced Data, published by Packt.
Tackle imbalanced datasets using machine learning and deep learning techniques
As machine learning practitioners, we often encounter imbalanced datasets in which one class has considerably fewer instances than the other. Many machine learning algorithms assume an equilibrium between majority and minority classes, leading to suboptimal performance on imbalanced data. This comprehensive guide helps you address this class imbalance to significantly improve model performance.
This book covers the following exciting features:
- Use imbalanced data in your machine learning models effectively
- Explore the metrics used when classes are imbalanced
- Understand how and when to apply various sampling methods such as over-sampling and under-sampling
- Apply data-based, algorithm-based, and hybrid approaches to deal with class imbalance
- Combine and choose from various options for data balancing while avoiding common pitfalls
- Understand the concepts of model calibration and threshold adjustment in the context of dealing with imbalanced datasets
If you feel this book is for you, get your copy today!
If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Please go to this link to claim your free PDF.
All of the code is organized into folders.
The code will look like the following:
from collections import Counter
X, y = make_data(sep=2)
print(y.value_counts())
sns.scatterplot(data=X, x="feature_1", y="feature_2")
plt.title('Separation: {}'.format(separation))
plt.show()
Following is what you need for this book: This book is for machine learning practitioners who want to effectively address the challenges of imbalanced datasets in their projects. Data scientists, machine learning engineers/scientists, research scientists/engineers, and data scientists/engineers will find this book helpful. Though complete beginners are welcome to read this book, some familiarity with core machine learning concepts will help readers maximize the benefits and insights gained from this comprehensive resource.
With the following software and hardware list you can run all code files present in the book (Chapter 1-10).
Chapter | Software required | OS required |
---|---|---|
1-10 | Google Colab | Any OS |
If you have any questions or feedback, please feel free to use the Discussions tab of this repository. You can start a new discussion under an appropriate category.
Kumar Abhishek is a seasoned Senior Machine Learning Engineer, specializing in risk analysis and fraud detection. With over a decade of experience at companies such as Expedia, Microsoft, Amazon, and a Bay Area startup, Kumar holds an MS in Computer Science from the University of Florida.
Dr. Mounir Abdelaziz is a deep learning researcher specializing in computer vision applications. He holds a Ph.D. in computer science and technology from Central South University, China. During his Ph.D. journey, he developed innovative algorithms to address practical computer vision challenges. He has also authored numerous research articles in the field of few-shot learning for image classification.
- Introduction to Data Imbalance in Machine Learning [open dir]
- Oversampling Methods [open dir]
- Undersampling Methods [open dir]
- Ensemble Methods [open dir]
- Cost-Sensitive Learning [open dir]
- Data Imbalance in Deep Learning [open dir]
- Data-Level Deep Learning Methods [open dir]
- Algorithm-Level Deep Learning Techniques [open dir]
- Hybrid Deep Learning Methods [open dir]
- Model Calibration [open dir]
Notebook ID | Description | Link |
---|---|---|
Notebook 1.1 | Imbalanced-learn demo | ipynb/colab |
Notebook 2.1 | Oversampling techniques | ipynb/colab |
Notebook 2.2 | Oversampling performance | ipynb/colab |
Notebook 2.3 | SMOTE problems | ipynb/colab |
Notebook 3.1 | Various undersampling techniques | ipynb/colab |
Notebook 3.2 | Undersampling performance | ipynb/colab |
Notebook 4.1 | Ensemble techniques overview | ipynb/colab |
Notebook 4.2 | Ensembling methods performance | ipynb/colab |
Notebook 5.1 | Class weight with Sklearn/XGBoost | ipynb/colab |
Notebook 5.2 | Threshold tuning techniques | ipynb/colab |
Notebook 6.1 | Simple neural network | ipynb/colab |
Notebook 6.2 | Multi-class classification | ipynb/colab |
Notebook 7.1 | Augmix on FashionMNIST | ipynb/colab |
Notebook 7.2 | Cutmix, Mixup, Remix on FashionMNIST | ipynb/colab |
Notebook 7.3 | NLP data-level techniques | ipynb/colab |
Notebook 7.4 | Dynamic sampling | ipynb/colab |
Notebook 7.5 | VAE with MNIST | ipynb/colab |
Notebook 7.6 | Cutmix technique | ipynb/colab |
Notebook 7.7 | Cutout technique | ipynb/colab |
Notebook 7.8 | Mixup technique | ipynb/colab |
Notebook 7.9 | Data transformation plotting | ipynb/colab |
Notebook 8.1 | CIFAR10 focal loss | ipynb/colab |
Notebook 8.2 | CDT loss implementation | ipynb/colab |
Notebook 8.3 | Class balanced loss | ipynb/colab |
Notebook 8.4 | Class-wise difficulty balanced loss | ipynb/colab |
Notebook 8.5 | DRW technique | ipynb/colab |
Notebook 8.6 | Tweet emotion detection | ipynb/colab |
Notebook 8.7 | PyTorch class weighting | ipynb/colab |
Notebook 9.1 | GNN demo | ipynb/colab |
Notebook 9.2 | OHEM technique | ipynb/colab |
Notebook 9.3 | Class rectification loss | ipynb/colab |
Notebook 10.1 | Calibration techniques | ipynb/colab |
Notebook 10.2 | Sampling/weighting impact on calibration | ipynb/colab |
Notebook 10.3 | Imbalance handling impact on calibration | ipynb/colab |
Notebook 10.4 | Kaggle HR data calibration | ipynb/colab |
Notebook 10.5 | Plat's scaling and isotonic regression | ipynb/colab |
Kumar Abhishek, Dr. Mounir Abdelaziz, Machine Learning for Imbalanced Data. Packt Publishing, 2023.
@book{mlimbdata2023,
title = {Machine Learning for Imbalanced Data},
author = {Kumar Abhishek and Mounir Abdelaziz},
year = {2023},
publisher = {Packt},
isbn = {9781801070836}
}