sheffieldnlp/naacl2018-fever

Error Analysis

Closed this issue · 2 comments

j6mes commented
  • how often did DR return the right page?
  • how often did SR return the right page?
  • how often did SR return the original evidence?
  • for the times where SR returned different evidence. What are the differences between BLEU/ROUGE similarities between the claim and returned evidence vs claim and gold evidence?
  • Error coding scheme
j6mes commented
Metric NLTK DRQA Sents Precomputed IDF DRQA Sents New IDF
Runtime 2 hours 10 hours 12 hours
Strict Accuracy (strict) requirement for correct evidence 0.2476 0.1827 0.2698
Classification Accuracy Without Need For Evidence 0.4885 0.4588 0.4922
Correct Document Return Rate (dmatch) 0.5793 0.5893 0.5893
Correct Document Return Rate after sentence selection (smatch) 0.4773 0.2690 0.5596
Correct Text Return Rate (for Refutes/Supports) 0.3647 0.1083 0.4680
j6mes commented

@andreasvlachos using DrQA instead of NLTK for sentence selection gives us a 2% boost - at the cost of an extra 10 hours. dmatch and smatch figures give us upper bounds for strict accuracy (considering the supported/refuted class). In the case of DrQA - the number of times the correct document is in the evidence after sentence selection is 55% of the time whereas using NLTK, this is only 47%.