Awesome Imbalanced Learning in Production

Awesome

If you find this content insightful, feel free to support with a ⭐

A search for "Imbalanced data" on Google Scholar yields over one million results in total, with approximately 17,000 results since 2023. This raises a question: Is it merely a popular topic for academic papers, or is it a genuine challenge faced in the industry?

This GitHub repository offers a curated list of resources demonstrating how various companies have successfully tackled the challenge of imbalanced data in production settings. This repository is aimed at being an invaluable resource for professionals/researchers/enthusiasts in machine learning and data science dealing with similar challenges.

Inspired By Awesome-System-Design

Contents

Case Studies

Articles

Videos

Other general resources on the topic

Oversampling Methods

Technique Company
Random oversampling Grab
SMOTE Microsoft
Borderline-SMOTE Amazon, Paper

Undersampling Methods

Technique Company
Negative downsampling Meta
Negative downsampling Microsoft
Negative downsampling Uber

Ensemble Methods

Technique Company
BalancedBaggingClassifier Microsoft

Cost-Sensitive Learning

Technique Company
Cost-sensitive learning Microsoft
Cost-sensitive learning Airbnb

Data-Level Deep Learning Methods

Application Company Year
Real-Time Personalization Etsy 2023
Automated Image Tagging Booking.com 2017
Shot Angle Prediction Wayfair 2020
Data Protection Grab 2021
ML Model Improvement Cloudflare 2022

Algorithm-Level Deep Learning Techniques

Technique Company
DALL·E 2 pre-training mitigations OpenAI
BERT for NLP Wayfair
Focal loss Meta
Class-balanced loss Apple

Hybrid Deep Learning Methods

Method Company Year
Relational Graph Learning Uber 2021
Graph for Fraud Detection Grab 2022

Model Calibration

Technique Company Year
Model Calibration Netflix 2018
Model Calibration Meta 2014

Case Studies to check out

Articles

Videos

Video Title Company Year
Natalie Hockham: Machine learning with imbalanced data sets GoCardless 2015
PyData Tel Aviv Meetup: GANs for Imbalanced Classification Problems - Moshe Salhov Playtika 2018
Brendan Herger - Machine Learning Techniques for Class Imbalances & Adversaries CapitalOne 2016

Other general resources on the topic

Libraries, Tools and Frameworks

Datasets

Task Dataset # Classes # Training Data # Test Data
Image Classification ImageNet-LT [15] 1,000 115,846 50,000
CIFAR100-LT [18] 100 50,000 10,000
Places-LT [15] 365 62,500 36,500
iNaturalist 2018 [23] 8,142 437,513 24,426
Object Detection/Instance Segmentation LVIS v0.5 [36] 1,230 57,000 20,000
LVIS v1 [36] 1,203 100,000 19,800
Multi-label Classification VOC-LT [37] 20 1,142 4,952
COCO-LT [37] 80 1,909 5,000
Video Classification VideoLT [38] 1,004 179,352 51,244

Benchmarks

awesome-libraries on Github

Books


Please note that this is a living document and will be updated with more resources over time. Contributions are welcome!