shizhouxing/DialogueDiscourseParsing

Need help to replicate the results

jasonwu0731 opened this issue · 4 comments

Hello,

I followed your instruction to download the same dataset split and run python main.py --train with all the default parameters. In your original code, it seems that you are evaluating the test set during the training period (which is not supposed to), so I follow your paper, "We retained 10% of the training dialogues for validation", to further split the train set into train+dev:

Below is the results I obtained after training. I select the best dev set checkpoints to evaluate, which is at epoch 46. It seems that the results are not close to the reported numbers in the paper (73.2 and 55.7, respectively).

Do you have any suggestions on what should I do to replicate your results? Am I missing any essential details?

Thank you so much in advance for your reply.

Loading data: ../data/stac/stac-linguistic-2018-05-04/processed_data/train.json
977 dialogs, 10630 edus, 10297 relations, 129 backward relations
684 edus have multiple parents
Loading data: ../data/stac/stac-linguistic-2018-05-04/processed_data/dev.json
109 dialogs, 1268 edus, 1230 relations, 11 backward relations
74 edus have multiple parents
Loading data: ../data/stac/stac-linguistic-2018-05-04/processed_data/test.json
111 dialogs, 1156 edus, 1126 relations, 8 backward relations
77 edus have multiple parents
Building vocabulary...
Loading word vectors...
Pre-trained vectors: 2579/2623
Dataset sizes: 977 / 109 / 111
Reading model parameters from dev_model/checkpoint-00000046
Test:
  test loss_bi: 0.00000
  test loss_multi: 1.00065
  test f1_bi: 0.70706
  test f1_multi: 0.52737

Hi @jasonwu0731, final results were reported by taking the epoch with the best performance on test set without using the validation set, as the code shows. Since this dataset is too small, I think taking the best epoch using a development set can be unstable and cannot well predict the performance on test set.

Ok got it.

Does the same checkpoint selection strategy also apply to the baselines reported in the paper (Table1)? Thanks.