Number of differences between CoNLL++ and CoNLL2003 test sets
alexeyev opened this issue · 3 comments
Dear colleagues, thank your work and for releasing the code and the data!
I have converted both original and corrected CoNLL2003 test sets into labeled one-line sequences; this resulted in 168 'diffs', while the README states there should be 186 sentences that differ from the original test set. Can this be a misprint?
Thank you.
Hi,
I checked the conllpp_test.txt in the data/ folder and the conll dataset from https://huggingface.co/datasets/conll2003/blob/main/conll2003.py.
I wrote something like
a = open("conllpp_test.txt", "r").readlines()
b = open("test.txt", "r").readlines()
total = 0
cur = 0
for _a, _b in zip(a, b):
_a = _a.strip().split(" ")[-1]
_b = _b.strip().split(" ")[-1]
if len(_a) == 0:
total += cur
cur = 0
elif _a != _b:
cur = 1
if cur:
total += cur
print(total)
and the output is 186. Do you spot anything different than what you tried?
Thank you for such a swift response -- and sorry for bothering you. Running your script against the original test.txt
and conllpp_test.txt
does output 186.
I convert the original markup into sentences that look like this:
[ Leeds ]ORG had already fined [ Bowyer ]PER 4,000 pounds ( $ 6,600 ) and warned him a repeat of his criminal behaviour could cost him his place in the side .
So it seems that some of the tagging errors in test.txt
you have corrected to build CoNLL++ simply don't exist after the conversion. Or, maybe, I have a bug in the conversion script :) Will check. Thanks again.
how did you prepare dataset, where did you annotate that for custom data training?