Reproducing results and metrics questions

Question

Reproducing results and metrics questions

carlos-gemmell opened this issue 4 years ago · 0 comments

Dear authors,

Thank you for your contributions with this exciting new dataset.
I have run your scripts and obtained the following scores on dev.

Doc retrieval on train:
{'hit5_900': 33.900170601507895, 'hit8_900': 35.00632876561554, 'hit10_900': 35.31451213472016, 'exact_900': 18.254361345000277, 'f1_900': 58.984328870510964, 'total_900': 18171}

Doc retrieval on dev:
{'hit5_900': 23.325, 'hit8_900': 25.025, 'hit10_900': 25.575, 'exact_900': 9.325, 'f1_900': 52.60715219421167, 'total_900': 4000}

I trained the sentence extraction model, output for the training script is this:
{'exact_100': 0.35, 'f1_100': 9.88619047619044, 'total_100': 4000}

sentence extraction on the dev set:
{'exact_1900': 3.7, 'f1_1900': 41.555490342990716, 'total_1900': 4000}
On train:
{'exact_1900': 11.53486324362996, 'f1_1900': 52.726213672225306, 'total_1900': 18171}

Results from training the claim verification model:
{'acc_100': 52.425, 'total_100': 4000}
On dev:
{'acc_2000': 65.125, 'total_2000': 4000}

I have several questions.

Can you confirm these scores are aligned with your results on dev? I can see a few discrepancies from the scores in the paper.
What does the *_900 of these scores mean? Why does it change later to *_1900?
Could you detail with an example how exact match is used here? Does it mean that only the correct sentences are obtained from the sentence extraction model? For document retrieval EM is not a common score, can you outline how you calculate this with an example? Can you also indicate the depth at which you obtain the scores for F1 and separate F1 into precision and recall?
Why is the HoVer score not part of the evaluation measures? Could this be included?
Similar to the evaluation you provide for the test set on your website, could a minimal standalone script be provided to quickly get sentence, paragraph, and claim scores on dev without running run_hover.py since it requires many dependencies?

Resolving these would help to cite your work in future research.
Best regards.