Neural Network implementation in Numpy & Keras.
The code consists in the following (The idea is that in the future this can be combined, as of now the are used independently):
-
deep_neural_network_base: Class that will build a MLP based on the input activation functions per layer and number of nodes.
-
deep_neural_network_batch_normalization: Extension of deep_neural_network_base class, that implements batch normalization (Needs work, mean and variance for inference is not properly coded).
-
deep_neural_network_dropout: Extension of deep_neural_network_base class that implements dropout for training and inference.
-
deep_neural_network_l2regularization: Extension of deep_neural_network_base class that implements L2 regularization.
-
deep_neural_network_optimized: Extension of deep_neural_network_base class that given the following options for gradient update:
- Momentum.
- Nag: pending.
- Adagrad.
- RMSprop.
- Adadelta.
- Adam.
- Adamax.
- Nadam: pending.
-
launch_nn_training: Code to test out the implementation on MNIST data.
-
deep_neural_network_keras: Keras implementation of an MLP for MNIST, used to do a comprehensive analysis on the explanations below.
This is a brief summary of my own understanding for: Regularization, Optimizations, Batch Normalization and Gradient updates.
- Mathematical demonstration: Backpropagation
- Mathematical demonstration: Cross-entropy & Softmax gradients
- Mathematical demonstration: Batch Normalization Backpropagation
- It doesn’t allow on-line training.
- Slow update.
- Allows on-line training.
- Updates only on one sample which cause to have high variance in the cost function with the weight updates.
- Complicates getting to the minima as it overshoots constantly.
- With a proper size reduces the variances on the updates.
- Allows a faster converge to a minima since updates sooner than batch.
- A good matrix size cam take advantage of different computer architectures for matrix operations.
- Limits the amount that the data distribution in the layer can shift affected by weight and bias updates.
- It weakens the coupling between one layer and the previous one.
- It has the same effect as normalizing the inputs, it will control the data distribution from getting really high or really low values through the network. It allows a faster convergence to the minima.
- Includes a regularization effect since in training the normalization is done through the mini-batch, including some noise that translates in the network activation outputs Z[l]. Similar to Dropout in effect.
- Mathematical demostration: Batch Normalization Gradient
- Techniques to prevent the neural network to overfit on the training data.
- Forces the weights to take lower values and preventing high standouts for the weight values.
- Penalizes the Cost function with the median quadratic values for the weights.
- Prevents weights to take large values, this helps to generalize features, not giving to much importance to some compare to others:
- When the weight is updated, there's a certain weight decay introduced from the gradient of Cost function (Added median quadratic values in cost function).
- It penalizes the weight with the absolute value.
- Same intention as L2, but lower impact.
- Randomly drops activation outputs along the network, only during training and based on a certain probability.
- The objective is to balance the weight values preventing them to fit only on the training set samples and not for the general case.
-
Momentum:
- Uses previous gradients (cost function slopes) to overcome possible local minima valley.
- Computes a weighted average of previous and current gradient.
-
Nesterov accelerated gradient:
- TODO
-
Adagrad:
- Normalizes the weight updates; dividing by the the variance of the weight updates.
- Gradients with larger updates will effectively receive a smaller update.
- The monotonic learning rate proves to be too aggressive and stops learning sooner.
-
RMSprop:
- Adjust the aggressiveness of the monotonically decreasing learning rate in Adagrad.
- Unlike in Adagrad, the updates do not get monotonically smaller.
-
Adadelta:
- Based on Adagrad. Improves its problem with a converging to 0 update over time.
- Based on two main ideas:
- Scale learning rate based on historical gradient that only takes into account a given window of updates.
- Use component that serves as an acceleration term, as in momentum.
-
Adam:
- Combines Momentum and RMS prop.
- Commonly used a default optimization algorithm.
- It also includes the bias correction for the initial values, that way at the beginning it isn't so off.
-
Adamax:
- Similar to Adam.
- Changes over the learning rate factor. When the update is small, it is ignored --> This makes it more robust to noisy gradients.
-
Nadam:
- TODO
- Learning rate = 0.002
- Epochs = 25
- Batch size = 128
- Dropout prob = 0.2
- Weight Initialization = He over normal distribution.