Error Analysis

Question

Error Analysis

Closed this issue 6 years ago · 2 comments

how often did DR return the right page?
how often did SR return the right page?
how often did SR return the original evidence?
for the times where SR returned different evidence. What are the differences between BLEU/ROUGE similarities between the claim and returned evidence vs claim and gold evidence?
Error coding scheme

Answer 1 · 2018-02-10T14:50:52.000Z

Metric	NLTK	DRQA Sents Precomputed IDF	DRQA Sents New IDF
Runtime	2 hours	10 hours	12 hours
Strict Accuracy (strict) requirement for correct evidence	0.2476	0.1827	0.2698
Classification Accuracy Without Need For Evidence	0.4885	0.4588	0.4922
Correct Document Return Rate (dmatch)	0.5793	0.5893	0.5893
Correct Document Return Rate after sentence selection (smatch)	0.4773	0.2690	0.5596
Correct Text Return Rate (for Refutes/Supports)	0.3647	0.1083	0.4680

Answer 2 · 2018-02-10T14:53:58.000Z

@andreasvlachos using DrQA instead of NLTK for sentence selection gives us a 2% boost - at the cost of an extra 10 hours. dmatch and smatch figures give us upper bounds for strict accuracy (considering the supported/refuted class). In the case of DrQA - the number of times the correct document is in the evidence after sentence selection is 55% of the time whereas using NLTK, this is only 47%.