gmftbyGMFTBY/MultiTurnDialogZoo

training epochs

katherinelyx opened this issue · 10 comments

Hi. First, thank you for your amazing work. Your codes inspire me a lot and make my re-producing process much more efficient.
There is a question about the training epoch. In the raw files, you set training epoch as 100. I wonder whether it is proper for all the implemented models?

Hi, your question is good.

In this repo, we want to compare all the baselines as fair as possible. So during training all of the baselines, the training epoch are set as 100.

It is possible to change this hyperparameter for updating the baselines further, but that is not my focus.

If you like this work, you can star it and fork for further editing.

Hi, do you mean the early stop mechanism for avoiding the overfitting?

If so, you can find the patience hyperparameters in train.py. I seem to comment the early stop mechanism, but you can recover it.

Hi. I just tried your codes and found it pretty practical. However, I noticed that in the processed PersonaChat data, 'src' not only contains conversation but also contains the persona information. I wonder that for the input of each model, do you feed the persona together with the conversation?

Thank you.

Yes, I use the preprocessed data from this link. In this link, the researchers already combine the persona information and the conversation context (each persona sentence is treated as a sentence in the dialog history), and all the models use the persona information.

Thank you. I trained all the listed models on the EmpChat dataset with 20 epochs and conduct the evaluation in terms of BLEUs and Distincts. I use two types of evalustion codes, including codes from your repo and other repos. I find that the Distincts values are much lower, such the Distinct-1 are all 0.00XX, and the Distinct-2 are 0.0XX. I wonder whether the training epochs are not enough?

Hi, I think 20 epochs are very small. You can try the bigger setting such as 100 and 200.
In my experiments, I run 100 epoch on EmpChat and the best distinct scores can be 0.95 / 6.02 (%).
In my codes, I use the learning ratio decay, and if you want to achieve better performance, make sure delete codes of the learning ratio decay.

Wow. 100 epochs is quite big.
I am working on a paper about multi-turn conversation generation. So I really get some advice from you about the training process of the baseline models.
In my previous experiments, I indeed found that more training epochs may bring higher distincts, while the BLEUs may decline. I used to implement HRED with Tensorflow on the DailyDialog dataset, and after training about 10 epochs, the trade-off between BLEU and Distinct displays increasingly obvious.
What do you think about the comparisons between different models?
Whether we should fine-tuning each model to obtain the best performance, or just employ same training epochs?

Sorry about the late response, I am writing the papers recently.
As for your questions, my answer is yes. Training more epochs will lead to higher distinct metric, but the BLEU metric tends to be stable. I think this phenomenon is caused by the inadequate training progress. In this repo, the distinct metric will achieve higher scores with more epochs, but it will be stable at 60 or 70 epochs in my experiments. As for the BLEU metric, I think the reason is that BLEU itself is not suitable for evaluating open-domain dialog systems. So the BLEU scores may be biased. I recommend you to use other better metric for evaluating, such as BERTScore and RUBER (I already release the codes of RUBER and BERT-RUBER, you can find it in my GitHub homepage).

Although the 100 epochs seems big, actually it will not cost much time for training the models in this repo. In this repo, training HRED one epoch only cost me about 10 to 15 minutes, and 100 epochs will only cost you about one day. But the hierarchical models with word-level attention mechanism will cost about 2 times training times because of the disadvantage of word-level attention, which is a still open problem.

In this repo, the same epochs is because the consistent comparison. You can try other settings to test whether the model is over fitting by using better automatic evaluation.

Sorry about the late response.