Retraining from scratch yields worse results

Question

Retraining from scratch yields worse results

nkcr opened this issue 6 years ago · 5 comments

Hello,

As written in the paper (2.2 Deriving Architectures) I tried to re-retrain from scratch the best derived model, but it surprisingly gives worse result when I retrain it from scratch than if I would keep the original (shared) weights.
I expected training the best model (dag) from scratch to be faster and eventually have a better perplexity, but it's not the case.

I do the following:

Launch ENAS with the --load_path argument, which loads a previous run, and the --mode test, which will call a custom test method inside the trainer class
(In the test method) I reset the shared weights with self.shared.reset_parameters()
I derive the best model (dag)
Then I train this model from scratch, iterating over the train set for N epochs (like in the train_shared method)

The following picture shows the loss and ppl during the "normal" training (first slope) and after reseting the shared weights (second slope). The second slope only trains the same best model (dag).

Has anyone any idea about why resetting the shared weight and re-training from scratch is so bad?

Answer 1 · 2018-10-30T15:49:41.000Z

Just to make sure I'm understanding: when you say first slope you're referring to the part of the graph from 0 to 60.00K on the x axis? And the 2nd slope starts at 80.00k? So you did the self.shared.reset_parameters() at 60.00K, correct?

I guess I'm not too surprised at this result. There's very little information in the paper about this retraining step - it's all in section 2.2 under Deriving Architectures: "We then take only the model with the highest reward to re-train from scratch" - AFAICT that's it, that's the whole description of the retraining step. If you look at the Tensorflow enas implementation from the paper authors (https://github.com/melodyguan/enas) you'll see that there are two scripts: ptb_search.sh and ptb_final.sh. The latter script is used to retrain the best found dag (and in fact they've hard-coded the best found dag to be exactly the one found in the paper). Doing a comparison between them I notice that several parameters are different between the two: The lstm_hidden_size is 720 in ptb_search.sh while it's 748 in ptb_final.sh, for example, and the parameters related to learning rate are very different as well. Perhaps you could try retraining using their parameter values from the ptb_final.sh?

Answer 2 · 2018-10-31T08:40:18.000Z

Yes, correct. The second part starts at around 80k.

Interesting, I will try to use the same parameters and compare the results, thanks for the suggestion.

Answer 3 · 2018-11-06T10:31:07.000Z

#35 Provides a way to train a single, given, dag

Answer 4 · 2018-11-19T19:52:32.000Z

@nkcr were you able to better results with the parameters from the TF code?

Answer 5 · 2018-11-19T20:16:44.000Z

No, not really. I didn't investigate much but with a quick matching from the tensorflow implementation I got worse results.