/Analyzing-Weight-Initialization-in-Neural-Networks

In this project we will study the impact of various weight initializations in neural networks on MNIST dataset.

Primary LanguagePythonMIT LicenseMIT

Analyzing-Weight-Initialization-in-Neural-Networks

Every Deep Learning book/tutorial points out the fact that weight initialization is an important design choice when developing deep learning neural network models. The initialization step can be critical to the model's ultimate performance, and it requires the right method.
In this repo, we show the impact of various weight initializations on the accuracy of our model. We train our model on MNIST dataset.
We consider 7 types of weight intializations:

  1. Constant: All weights are initialized to 0.
  2. GlorotNormal/ Xavier normal: Values of the weights are sampled from a truncated normal distribution centred on 0 with stddev = sqrt(2 / (fan_in + fan_out)) where fan_in is the number of input units in the weight tensor and fan_out is the number of output units in the weight tensor.
  3. HeNormal: Values of the weights are sampled from a truncated normal distribution centred on 0 with stddev = sqrt(2 / fan_in) where fan_in is the number of input units in the weight tensor.
  4. Standard Normal Distribution: Values of the weights are sampled from a normal distribution centred on 0 with stddev = 1 .
  5. GlorotUniform/ Xavier Uniform: Values of the weights are sampled from a uniform distribution within [-limit, limit], where limit = sqrt(6 / (fan_in + fan_out)) (fan_in is the number of input units in the weight tensor and fan_out is the number of output units).
  6. HeUniform: Values of the weights are sampled from a uniform distribution within [-limit, limit], where limit = sqrt(6 / fan_in) (fan_in is the number of input units in the weight tensor).
  7. Uniform(0,1): Values of the weights are sampled from a uniform distribution within [0, 1] .

Visualization

The model is evaluated using five-fold cross-validation. We plot the accuracies of the 5 folds for the 7 types of initialization as mentioned before.

Next, we observe the average accuracy for each of these weight intializers.

Conclusion

We observe that for MNIST dataset, the model performs worst for Constant, Random Uniform and Random Normal. For the other weight initializers, the model performs performs the best and the accuracies are more or less similar. This conforms to the literature that whenever weights are initialized to 0 , the training of the neural network stops as there is no change in gradient. 
One point to note is that this is not a generalized conclusion i.e. the results can be different for other datasets and other set of hyper- parameters. We have tried to show here the impact of the weight intializers under a set of standard hyper-parameters for MNIST dataset. The reader is encouraged to try these weight intializers on other datasets like ImageNet, Iris etc.