hikettei/cl-waffe2

examples/mnist/mlp.lisp - reset-compiled-function-cache! question

atzmueller opened this issue · 3 comments

Using the (current version) of mlp.lisp, for the first call of, e.g.,
(train-and-valid-mlp :epoch-num 11 :benchmark-p nil)
the training loss in the first epoch is around 0.26 usually.

For further runs (when evaluating (train-and-valid-mlp :epoch-num 11 :benchmark-p nil)), the loss is larger (around 0.76 in the first epoch). I suspect that this is caused by some caching in the compiler and different initializations of the compiled structures, since if I evaluate
(cl-waffe2/vm.generic-tensor::reset-compiled-function-cache!)
before evaluating
(train-and-valid-mlp :epoch-num 11 :benchmark-p nil),
then the loss is in the same range as for the very first run.

Is this the intended behavior, or should the reset be applied somewhere when the model is built/compiled?

I didn't expect such a behavior because caches are created for each AbstractNode whose complied codes does not include any tensors. (As a design), users do not have to consider the existence of the function as it is not exported.

I suspect that this is caused by some caching in the compiler and different initializations ...

I thought about that one too; and I guess the current implementation of adam has something to do with since if i change these lines:

;; Case1. Adam -> SGD
;; From
(mapc (hooker x (Adam x :lr lr)) (model-parameters model))
;; To
(mapc (hooker x (SGD x :lr lr)) (model-parameters model))
;; Case2. Deleting Adam
;; deleting this line (and fails to optimize the model)
(mapc #'call-optimizer! (model-parameters model))

In the both two cases, the loss is in the same range for each epoch, and each run.
I'm still analyzing the issue. Thank you for the bug report.

The issue should be fixed at the latest PR #149; so closing.
I have been had a lot on my plates this month and i could not tackle this issue quickly; sorry for my delayed answer.

Yes, it works, thanks!