Rerun results lower than what's reported

Question

Rerun results lower than what's reported

Closed this issue 2 years ago · 2 comments

Hello. I reran the GEC-PD experiment with the provided data and code in the repo. However, the results I got were lower then what are reported in the repo.

Results of the repo:

S0: 41.48 | 21.44 | 34.94
S1: 31.11 | 19.37 | 27.74
G0: 42.41 | 23.01 | 36.29
G1: 32.00 | 23.28 | 29.77

S avg: 36.30 | 20.40 | 31.34
G avg: 37.21 | 23.15 | 33.03

Rerun results:

S0: 38.54 | 19.10 | 31.99
S1: 30.33 | 18.09 | 26.69
G0: 42.38 | 21.19 | 35.30
G1: 32.06 | 21.50 | 29.17

S avg: 34.43 | 18.60 | 29.34
G avg: 37.22 | 21.35 | 32.24

Environment:

OS: Ubuntu 18.04.1 64 bits
Python version 3.7.11
Pytorch version 1.7.1
CUDA Version 11.2

Here are several possible reasons I guess that led to the performance gap:

Choice of the best model for generating predictions with the test sets and for evaluation (calculating precision / recall / $F_{0.5}$). I used the best checkpoint during training (checkpoint_best.pt generated by fairseq). In the sample code of the repo it is checkpoint3.pt but why?
ERRANT version. I used errant==2.3.0.
Random seeds. I used [10, 20, 30] and took the average.

Since the evaluation script was not released by the repo, I am not sure how the trained models in the paper were evaluated. Could you kindly provide more details, such as releasing the evaluation script?

Thank you very much.

Answer 1 · 2022-10-14T08:37:18.000Z

Hi,

Apologies for the late reply. I think the main reason is about choosing the checkpoint. We choose the best checkpoint based on validation set performance in terms of F0.5 score not the loss produced during training. So, in our experiment checkpoint3.pt gives the best validation F0.5 score. Maybe you could try to use checkpoint3.pt. The rest of reasons should not cause a problem.

Answer 2 · 2022-10-17T16:12:04.000Z

OK, I'll try it. Thank you ^w^