flairNLP/flair

Reproducing results on German CoNLL03 NER task

wangxinyu0922 opened this issue ยท 13 comments

Hi, I ran the template in
EXPERIMENTS.md
for NER German, but I cannot reproduce the NER German result posted in "Contextual String Embeddings for Sequence Labeling", which is 88.32 F1-score. In my experiment, the result is 83.22. What's the problem with this? I found the posted results in #713 are similar to mine.

I have a similar problem of reproducing the german NER results... Could someone come helping us?

Can confirm this (83.78% on test set) ๐Ÿค”

Ok I have to check my dataset. Everytime I've been running the code I get the results as reported, but from #713 it seems maybe something went wrong when I produced it from the ECI Multilingual Text cd following these instructions.

Unfortunately, I can already confirm that my dataset statistics are off and that many MISC annotations in particular are missing. I have to dig deeper into how this happened - I'll also produce the dataset again and rerun the experiment.

@alanakbik I could send you my pre-processed files. Unfortunately, I'm currently not able to pre-process them again, using the make.deu script, it always returns:

Use of uninitialized value $chunk in concatenation (.) or string at ../bin/revealLemmas line 16, <STDIN> line 330100.
alignment problem in data files

even when using an old perl version...

Update: I could sucessfully re-create the dataset ๐Ÿ˜…

Just set your console to Iso-latin (or better: use a Docker Container for that).

I'm still having difficulties running the make.deu script, but I think in trying to debug this I've almost solved the mystery of the different datasets.

Namely: Are you using the tags from etc/tags.deu or etc.2006/tags.deu? There are in fact two tags.deu provided in the tar file you can download here, the original one and a 2006 revision.

In the revision folder, the file etc.2006/revision.txt provides an explanation of the differences . An excerpt:

2. Major Changes
----------------
By far the most changes were made in the named entity class MISC:
 - adjectives derived from names ("deutsch") are no longer marked
 - nouns derived from names ("Frankfurter") are no longer marked
 - compounds which contain a name, but are not a name in itself
   ("SPD-Vorsitzender") are no longer marked

The rest of the changes were mainly simple corrections or changes to conserve
consistency.

I've collected statistics of both tags.deu files and they are as follows:

Original version: LOC: 6516 MISC: 3715 ORG: 4367 PER: 5316

2006 revision: LOC: 6519 MISC: 1182 ORG: 3818 PER: 5406

The statistics of my corpus are nearly (but not exactly) the same as the 2006 revision. When I created the dataset, I probably chose the newest tag set. This still does not explain everything but I think we're getting closer.

Oh, I was using the old tags from the 2003 (?) dataset -> I can confirm the massive differences for MISC tags:

2006 tags:

$ cat deu.t* | cut -d " " -f 5 | cut -d "-" -f 2 | sort | uniq -c                                                                                                                                                           
   7586 LOC
   1876 MISC
   6766 ORG
   8367 PER

2003 tags:

$ cat 2003/deu.t* | cut -d " " -f 5 | cut -d "-" -f 2 | sort | uniq -c
   7864 LOC
   4748 MISC
   7621 ORG
   8309 PER

Ok, I've just re-generated both variants of the CoNLL-03 corpus for German with the script and the CD, i.e. the original and the 06 revision. My corpus exactly matches the 06 revision, meaning that I used the updated tags when creating the dataset and also in my experiments.

I guess the big question now is what numbers to report. It seems that recent works are reporting over the original dataset, so to be comparable we'd probably have to report on this as well. However, as indicated in the dataset README, the original dataset had a lot of inconsistent annotation, so numbers reported over this data will likely not be meaningful, or at least not as meaningful as numbers reported over the cleaner 06 revision. Any preferences?

Hi all, thank you for your contributions.

Oh, I didn't know of the 06 revision. I actually raised the same question in #713 , but it's clear now thanks to your discussion here.

In my opinion, the 06 revision should be used in this situation, as @alanakbik says. The problem is that not a few researchers probably don't know of the 06 revision as I didn't. And, not a few papers/studies likely used the original one. So, we should and have to spread the 06 revision. For example, when we write a paper/document, we should explicitly write that we use it. This is my opinion.

Thanks.

Hi, @alanakbik

I agree we should start to use the 06 revised version, but do you think you might be able to point out that you are using the 06 version in the README. It takes me a long time trying to incorporate your settings (embeddings etc) in my own system but can't get the same level of results:)

So if you could add a note in README, people might save a lot of time and also easier for people to make fair comparisons with your system. Ideally, if you could also provide the results on the original corpus as well. I will start to report results on both versions in my own paper so that people might get to know the 06 version and might start to switch to it as well.

Best,

Juntao

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Added the info, and will point this out in future papers!