The number of examples in the out-of-domain dev set are not accurate

Question

The number of examples in the out-of-domain dev set are not accurate

maxsonate opened this issue 5 years ago · 1 comments

maxsonate commented 5 years ago

For example:
DROP: 1557 vs 1,503
DuoRC.ParaphraseRC: 1648 vs 1,501

Answer 1 · 2020-01-13T18:21:44.000Z

Hi,

How are you counting?

If you count using this script you should get the right numbers.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import sys
import json
import gzip

examples = 0
fname = sys.argv[1]

with gzip.open(fname, 'rb') as f:
    for i, line in enumerate(f):
        obj = json.loads(line)

        if i == 0 and 'header' in obj:
            continue

        examples += len(obj['qas'])

print('Num examples: %d' % examples)