EPFL Optimization for Machine Learning 2020

The impact of imbalanced datasets on optimizers' learning processes

Description

Optimizers are key parameters for efficient training on deep neural network. Current adaptive-learning-rate optimizers have significantly improved the optimization time of other widely spread fixed-learning-rate optimizers. For adaptive-learning-rate methods, an undesirably large variance in the early stages of training, due to the limited amount of training samples, might drive the model away from optimal solutions. Imbalanced data set, on the other hand, presents a severely skewed class distribution that may lead to a large variation of the gradient during the learning. This way, we will study the impact of imbalanced data sets on different optimizers' approach, as we suspect undesired behaviour over imbalanced datasets for well-known optimizer's such as SGD, RMSprop and Adam.

Getting Started

This version was designed for python 3.6.6 or higher. To run the model's calculation, it is only needed to execute the file run.py. On the terminal, the command is python run.py.

Prerequisites

Libraries

The following librairies are used:

numpy 1.14.3, can be obtained through anaconda
pandas, also available through anaconda
scikit-learn: pip install -U scikit-learn
keras: pip install Keras
tensor flow: pip install tensorflow
matplotlib: python -m pip install -U matplotlib
seaborn:pip install seaborn

For running RAdam optimizer in the notebook project.ipynb:

RAdam: pip install keras-rectified-adam

Code

To launch the code run.py use the following code files:

helpers.py: Deal with the creation of the spectrum, building of the neural network and the plots
benchmarking.py: Functions used for the benchmarking. It is composed of computation of the loss against the number of epochs and the recall, accuracy, precision and F1-score against the spectrum.

The datasets folder is also needed to store the full data set. For the experiment, the data set bank-additional-full.csv is used. In this folder, the datasheet bank-additional-full.csv is used for the experiment.

Additional content

The folder literature contains scientific papers that inspired our project. The folder figures present all the figures plotted for the report. The notebook DataAnalysis.ipynb has all the analysis on the raw data distribution. The notebook project.ipynb contains the simulations of run.py and the RAdam implementation with a loss plot against epochs, that is unsuccessful on this type of problems. This notebook requires the code helpers.py and the data set bank-additional-full.csv. The notebooks were used but not implemented.

Documentation

Class Project : Description of the project.
An overview of gradient descent optimization algorithms : Description of optimizers used in this project.
Neural Networks for Machine Learning: RMSprop
Adam: a method for stochastic optimization: Adam
Stochastic gradient descent: SGD
Incorporating Nesterov Momentum into Adam
On the variance of the adaptive learning rate and beyond: Rectified Adam
Keras: Keras library
Bank Marketing Data Set : Dataset of the experiment with its features.

Authors

Members: Cadillon Alexandre, Hoggett Emma, Moussa Abdeljalil

Project Status

The project was submitted on the 12 June 2020, as part of the Optimization for Machine Learning.

jalil-M/Opt-Project