Consider switching from RMSPropOptimizer to AdamOptimizer

Question

Consider switching from RMSPropOptimizer to AdamOptimizer

Chazzz opened this issue 6 years ago · 5 comments

I've been consistently getting 68-69% word accuracy using the AdamOptimizer. I like that Adam improves accuracy fairly consistently, whereas the jitter present in RMSProp makes the program more likely to terminate before reaching 68% or higher. I measured a ~25% per-epoch time penalty in using Adam, and it generally takes more epochs to reach a higher accuracy percentage (good problem to have).

I also experimented with various batch sizes with no meaningful improvement, though Adam with a default learning rate tends to do better with larger batch sizes.

Results:
AdamOptimizer (Tuned) Batch size 50
rate = 0.001 if self.batchesTrained < 10000 else 0.0001 # decay learning rate
end result: ('Epoch:', 68)
Character error rate: 13.104371%. Word accuracy: 69.008696%.
Character error rate: 13.082070%. Word accuracy: 69.026087%. (best)
end result: ('Epoch:', 46)
Character error rate: 13.577769%. Word accuracy: 68.295652%.
Character error rate: 13.600071%. Word accuracy: 68.452174%. (best)
end result: ('Epoch:', 55)
Character error rate: 13.198626%. Word accuracy: 68.782609%.
Character error rate: 12.984522%. Word accuracy: 69.165217%. (best)

Answer 1 · 2018-11-27T20:19:13.000Z

thank you for your research. Some questions:

Did you use the decaying learning rate or just the constant default value (0.001) provided with the TF Adam implementation?
Any other changes to the default values of the Adam optimizer?
Did I understand this correctly: one epoch takes +25% more time, and it also takes a larger number of epochs to train the model? How much is the overall increase of the training time (approximately)?
Just to be sure - you're using the default decoder (best path / greedy)?

Answer 2 · 2018-11-27T21:28:12.000Z

Decaying the learning rate over time gave me an improvement from 66% accuracy to 69%. I used rate = 0.001 if self.batchesTrained < 10000 else 0.0001
I didn't modify the other Adam parameters.
As you can see from my results, even with the exact same parameters the models can take 30%-50% more or less time to train from run to run. I believe the per-epoch improvement between the optimizers are comparable, but the AdamOptimizer takes more epochs on average mostly due to less premature stoppages. So a 25% increase in training time is probably a reasonable estimate, but I'm not claiming a lot of accuracy here. Do you have general performance/training time data for the baseline build?
I didn't modify the decoder so I believe I'm using the default.

Answer 3 · 2018-11-28T11:45:27.000Z

I think I'll stay with the RMSProp optimizer, because one of my goals for this SimpleHTR model is to be trainable on CPUs in a reasonable amount of time, and I don't want to increase the training time any further.
However, I'll add some information to the README and point to this issue, such that other users know about your findings.

Answer 4 · 2018-11-28T20:11:56.000Z

added your findings to README.

Answer 5 · 2018-11-28T20:19:40.000Z

This is a reasonable compromise.