This is an implementation of the Center Loss article (2016) trained on MNist dataset.
Paper: A Discriminative Feature Learning Approach for Deep Face Recognition
Link to the paper: https://ydwen.github.io/papers/WenECCV16.pdf
Tensorflow Version: 1.5
Mnist train-dataset: 55000 training examples
Mnist test-dataset: 10000 training examples
Used from tensorflow examples.
Result after 13000 iteration (which is roughly 23 epoch):
The above snapshot was made from a random 1000 training example at step 12400. For more snapshots about the training steps can be seen in training_snapshots folder.
The database was trained with AdamOptimizer.
Crucial steps were for the successful training:
- remove the bias term from the last layer (before cross entropy)
- training with 0.0005 learning rate instead of 0.001
- rescale the training examples from [0,255] to [-1,1] range
I was running this code on CPU. This is why just 1000 images were tested. I was tried this out with 10000 examples for plotting, but my memory was not enough.
The code contains the same nn.py file as in the weight-normalization code. However I am not using any of Weight-Normalization, Batch-normalization initialization for training this code. They are set to false. The only function parameter I use for creating the model template was the use_bias and the use_xavier_initialization parameters. At the last layer use_bias was set to false, and when training the latter parameter was set to true (see model.py).
Implementations and links that I found useful:
- https://github.com/EncodeTS/TensorFlow_Center_Loss
- https://github.com/pangyupo/mxnet_center_loss
- https://github.com/ydwen/caffe-face (Authors implementation in C++)
Notes: The code is not using weight normalization nor mean only batch normalization nor initialization. I should (have been) try this out, whether it can accelerate the training. I do think initialization here would be very useful (avoiding dead clusters). So every normalization technique could be very helpful to accelerate the training.
About the training loss:
200: 2.3
5000: 1.2
10000: 0.7
14200: 0.4
I assume at about 5000-6000 step the learning rate should have been divided to half, because it was platoed.