google-research/xtreme

reproducability question

944721805 opened this issue · 3 comments

Hi,I found that i can't achieve the same result especially on XNLI as the paper . I jsut bash the scripts without other operation.

DEV result is normal,but TEST maybe something wrong

Hi,

Can you post the numbers you're obtaining? We'd like to know the magnitude of the difference?

Also your command lines will help us debug.

@944721805 I believe that you observed the test scores closed to 0.33. This is because we replaced the true label for all the test sets by a placeholder label. The rational of this process is to prevent people from just modifying the ground-truth labels as a submission to our platform. Instead, we encourage participants to submit their results to our platform for evaluation.