ZihanWangKi/CrossWeigh

Number of differences between CoNLL++ and CoNLL2003 test sets

alexeyev opened this issue · 3 comments

Dear colleagues, thank your work and for releasing the code and the data!

I have converted both original and corrected CoNLL2003 test sets into labeled one-line sequences; this resulted in 168 'diffs', while the README states there should be 186 sentences that differ from the original test set. Can this be a misprint?

Thank you.

Hi,

I checked the conllpp_test.txt in the data/ folder and the conll dataset from https://huggingface.co/datasets/conll2003/blob/main/conll2003.py.
I wrote something like

a = open("conllpp_test.txt", "r").readlines()
b = open("test.txt", "r").readlines()
total = 0
cur = 0
for _a, _b in zip(a, b):
    _a = _a.strip().split(" ")[-1]
    _b = _b.strip().split(" ")[-1]
    if len(_a) == 0:
        total += cur
        cur = 0
    elif _a != _b:
        cur = 1
if cur:
    total += cur
print(total)

and the output is 186. Do you spot anything different than what you tried?

Thank you for such a swift response -- and sorry for bothering you. Running your script against the original test.txt and conllpp_test.txt does output 186.

I convert the original markup into sentences that look like this:
[ Leeds ]ORG had already fined [ Bowyer ]PER 4,000 pounds ( $ 6,600 ) and warned him a repeat of his criminal behaviour could cost him his place in the side .

So it seems that some of the tagging errors in test.txt you have corrected to build CoNLL++ simply don't exist after the conversion. Or, maybe, I have a bug in the conversion script :) Will check. Thanks again.

how did you prepare dataset, where did you annotate that for custom data training?