reproducability question

Question

reproducability question

944721805 opened this issue 5 years ago · 3 comments

Hi,I found that i can't achieve the same result especially on XNLI as the paper . I jsut bash the scripts without other operation.

Answer 1 · 2020-05-11T04:15:45.000Z

DEV result is normal，but TEST maybe something wrong

Answer 2 · 2020-05-12T18:06:22.000Z

Hi,

Can you post the numbers you're obtaining? We'd like to know the magnitude of the difference?

Also your command lines will help us debug.

Answer 3 · 2020-05-13T21:33:31.000Z

@944721805 I believe that you observed the test scores closed to 0.33. This is because we replaced the true label for all the test sets by a placeholder label. The rational of this process is to prevent people from just modifying the ground-truth labels as a submission to our platform. Instead, we encourage participants to submit their results to our platform for evaluation.