Training with Rational Activations on very deep ResNets.
Opened this issue · 1 comments
Hi,
I am using your pytorch implementation to train a Rational ResNet 164 on CIFAR 10 and while I can get the model to behave well for a ResNet with 18-38 layers, I cannot get it to train for very deep resnets without dramatically lowering the learning rate.
Here is 1 example with --lr 1e-6 --wd 1e-5
Train Epoch: 0 [0/47500 (0%)] Loss: 2.517
Train Epoch: 0 [1920/47500 (4%)] Loss: nan
While I understand that the model with rational activations is supposed to represent a rational function with degree 3layers, the training process for deeper models isn't clear.
Could you provide me some help ?
Thanks for your interest in our work. We haven't tried training very deep rational networks so my intuition is limited here. There is a possibility that the weight initialization has a bad effect on the rational layers as the depth increases. One potential remedy would be to fine-tune a pretrained relu resnet by replacing the activation functions by rationals and just training the rational functions.
I'm curious to see why the loss becomes nan in your example. Perhaps you could plot the different rational functions (there should be approximatively one function per layer) to see if one of them becomes singular (with a simple pole) and which layer is affected.
Finally, and depending and the result of the above suggestion, there could be some numerical instabilities due to having an overall rational network of super large degree (3^164). I guess one could use rational functions for the first few layers (like 18-38 layers in your experiments to benefit from the extra approximation power) and then use ReLU for the rest of the networks.