reproducability question
944721805 opened this issue · 3 comments
944721805 commented
Hi,I found that i can't achieve the same result especially on XNLI as the paper . I jsut bash the scripts without other operation.
944721805 commented
DEV result is normal,but TEST maybe something wrong
melvinjosej commented
Hi,
Can you post the numbers you're obtaining? We'd like to know the magnitude of the difference?
Also your command lines will help us debug.
JunjieHu commented
@944721805 I believe that you observed the test scores closed to 0.33. This is because we replaced the true label for all the test sets by a placeholder label. The rational of this process is to prevent people from just modifying the ground-truth labels as a submission to our platform. Instead, we encourage participants to submit their results to our platform for evaluation.