This project attempts to implement NIPS 2017 paper "Searching for activation function" (Zoph & Le 2017). Although neural networks are powerful and flexible models, they are still hard to design and limited by human creativity. Using a combination of exhaustive and reinforcement learning-based search, the paper claims to be able to discover multiple novel activation functions. We tried to verify the claims of this paper by replicating the original study. However we were unable to get good results as probably due to the lack of massive computing resources used in the original experiment (800 Titan X GPUs).
- Anaconda3
- TensorFlow-GPU >=1.4
If you do not have the right dependencies to run this project, you can use our docker image which we used too to run these experiments on.
docker pull etheleon/dotfiles
docker run --runtime=nvidia -it etheleon/dotfiles
Do a git clone of the repo first, then navigate into the src folder where the code of this project is stored
git clone https://github.com/Neoanarika/Searching-for-activation-functions.git
cd Searching-for-activation-functions
cd src
Download the data first, then find the activation functions
python cifar10_download_and_extract.py
python main.py
Next, test against your newly generated activation functions
python cifar100_download_and_extract.py
python cifar100_train.py
python cifar100_test.py
Or you can open up the jupyter notebook in the repo and run from there.
Activation functions |
---|
3x |
1 |
-3 |
Clearly we are doing something wrong, the problem with implementing these papers is that even if it doesn't work, it could be due to us not running it long enough, or perhaps there's a bug in the program that we are unaware of that is causing the negative result.
We also implemented swish, which was the activaiton function found and discussed in the original paper
python swish.py
We found a few things, the first is that sometimes during the inital phase of training, the loss function remains the same on average. This shows that swish suffers from poor intialisation during training, at least when using initally normal distributed weights with std_dev =0.1. We tried various initialisations but there were no improvements found. Finially changing the optimiser from SGD to Rmsprop solved the problem. The diagram above is from training with Rmsprop.
Swish has a sharp global minima especially when compared with Relu, which may account for the high variance of the gradient updates as the model might be stuck in the wedge to reach the global minima. Learning rate decay might thus help improve the training for models using swish. Furthermore a sharper minima corresponds with poorer generalisation, which might explain why it performs slightly worse than relu in practise.
@article{DBLP:journals/corr/abs-1710-05941,
author = {Prajit Ramachandran and
Barret Zoph and
Quoc V. Le},
title = {Searching for Activation Functions},
journal = {CoRR},
volume = {abs/1710.05941},
year = {2017},
url = {http://arxiv.org/abs/1710.05941},
archivePrefix = {arXiv},
eprint = {1710.05941},
timestamp = {Wed, 01 Nov 2017 19:05:42 +0100},
biburl = {http://dblp.org/rec/bib/journals/corr/abs-1710-05941},
bibsource = {dblp computer science bibliography, http://dblp.org}
}