Implement Adam optimizer

Question

Implement Adam optimizer

milancurcic opened this issue 3 years ago · 6 comments

Proposed by @rweed in #114.

Currently, the optimizers module is only a stub and the only available optimizer (SGD) is hardcoded in the network % train method, with updating of weights progataing all the way down to individual concrete layer implementations. Some refactoring is needed to decouple the weight updates from concrete layer implementations and to allow defining optimizer algorithms in their own concrete types.

Answer 1 · 2023-01-18T14:10:46.000Z

Milan

FYI, I found one Fortran implementation of Adam at
https://github.com/thchang/NN_MOD
Unfortunately, the comments in the code appear to suggest it was implemented
but never tested. Still looking for a batch normalization implementation (outside of Keras)
Also, one of my other "wants" , linear layers is trivial to implement. Took me all of about 5 minutes to do that in your existing code.

Answer 2 · 2023-01-19T18:26:30.000Z

Thanks for the link to NN_MOD. I'd like to work on Adam first. I think it's easier to implement than batch norm, and it will drive the much needed refactor for optimizers in general (rather than them being hardcorded in the network % train subroutine).

Would you like to contribute the linear layer here as a PR? As I understand it, it's just a dense layer but without an activation. Are you just using a dense layer but with a "no-op" activation function (i.e. y = x)?

Answer 3 · 2023-01-19T20:53:50.000Z

MIlan, Thanks for working on this. I'm using Keras for now but I'm slowiy moving up to some very large training sets so it would be nice to be able to use neural fortran. re. linear activation. The linear activation function is just a passthrough (basically ReLU without the MAX test). The derivative is just 1.0 ie pure function linear(x) result(y) y = x end function pure function linear prime(x) result(dy) dy = 1.0 end function Most of the examples I've seen were folks are doing regression and not classification have used a linear output layer. Re. batch normalization and dropout. When you feel you are up to doing one or the other I favor batch normalization over dropout. I use it first as a scaling pass prior to calling the first dense layer. I prefer this to doing a MINMAX scaling on the input data prior to training because it keeps the original unscaled data somewhat intact. Here is what I've implemented in Keras for the model to give you an idea of how I stack layers in a Sequential model. model = Sequential() if ibatchnorm > 0 : model.add(BatchNormalization()) for i in range (nhidden): model.add(Dense(nneurons, activation=actfun)) if ibatchnorm > 0: model.add(BatchNormalization()) if idropout > 0: model.add(Dropout(0.1)) model.add(Dense(1, activation='linear')) What I'm trying to build is a surrogate for CFD calculations to predict loads on buildings. My training datasets will eventually approach 100K or more samples. Some initial tests I've run with 3 inputs (xyz position) of probe locations on the target buildings used around 3600 samples. So far batch normalization (in the sequence show above) using 4 layers with 500 neurons per layer and adam as the optimizer is giving the best results. Works a lot better than using just dropout layers. Also, as reported by other researchers, mixing dropout with batch normalization doesn't help (and can sometimes hurt).

…

________________________________________ From: Milan Curcic ***@***.***> Sent: Thursday, January 19, 2023 12:26 PM To: modern-fortran/neural-fortran Cc: Weed, Richard; Mention Subject: Re: [modern-fortran/neural-fortran] Implement Adam optimizer (Issue #115) Thanks for the link to NN_MOD. I'd like to work on Adam first. I think it's easier to implement than batch norm, and it will drive the much needed refactor for optimizers in general (rather than them being hardcorded in the network % train subroutine). Would you like to contribute the linear layer here as a PR? As I understand it, it's just a dense layer but without an activation. Are you just using a dense layer but with a "no-op" activation function (i.e. y = x)? — Reply to this email directly, view it on GitHub<https://secure-web.cisco.com/1JjVl30q05HOvXpuULyEbyXhteR4Zz7vqBTpIPPY3TtNCCHL3OJjkMxsv-kE1gJ7Q4HApRK53tM5mzPjnpV9EwvVxcVAJ2KacGArNFSHe4PhLgifh2kQFZ4JUdUFmeGogAVaOJbYC17f2V8R0ag-rmMEKk0rmWkCtx-GJbk6Yy5hCP4VOaYcHCMI88_7VSygD3gwem1EfhT01xp8MysL4kTlgDaTPzbUHt9h-HvfFWPUTaOLo9lbL_6hIg83uhx0fZ_FtSExoM6CX-bZBCxkPYz3alcnIw-_yOdGSgU4hG6OA95KIPOXqPMP98533dFHX/https%3A%2F%2Fgithub.com%2Fmodern-fortran%2Fneural-fortran%2Fissues%2F115%23issuecomment-1397425420>, or unsubscribe<https://secure-web.cisco.com/18eBE1j-xXrIdm6Rn_VEJvBZ52kb4nKLDgB8sttFfG2p4q2UFfQg_tPQxVFIG6-CxjGznhzk4S3Ac2jpQG4DxjdYSPZNBqGnW0p2rTIb7eroPCiyLYvsJd8YFyrpIcHLSmWel1Ig_01cr5ibwDjJus6lFiGrIx0FktyrulsMArDDAWBemwwaQSM8Iis1PmFEvs6R1mMHLEbXwQutSBTAr08nkcQaHbJTI478bv-dqGe-Q5XkNsmambwamzwZ5BCqMQaPm994lv3uutCAh0O3JxGPiINylEjuPTd6FWQ4y_Q8JM2iztWb1JdfEQN_jCy3h/https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABJRI3NMD5NR5BE5RMA7RZTWTGBOBANCNFSM6AAAAAAT6C7MQE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 4 · 2023-07-14T19:00:45.000Z

@Spnetic-5, would you like to tackle this one next? I forgot whether you have a WIP implementation of Adam or AdaGrad?

Answer 5 · 2023-07-14T19:04:12.000Z

Yes, Adam optimizer implementation is under progress; I'll make a PR soon.

Answer 6 · 2023-07-26T14:17:27.000Z

Done by #150.