Incorrect classification of definite and indefinite articles in German
Closed this issue · 3 comments
Thank you very much for sharing Wapiti. This is really awesome.
Before I describe a potential issue with the German model (my hypothesis) that can be downloaded from your homepage, let me say that I have little experience with NLP or linguistics. Thus, I might be completely wrong.
Having played around with Wapiti (and now also RFTagger), it seems Wapiti consistently swaps the classification of definite and indefinite articles. For example:
Der ART.Indef.Nom.Sg.Masc*
Mann N.Reg.Nom.Sg.Masc
heiratet VFIN.Full.3.Sg.Pres.Ind
die ART.Indef.Nom.Sg.Fem*
Schwester N.Reg.Nom.Sg.Fem
des ART.Indef.Gen.Sg.Masc*
Freundes N.Reg.Gen.Sg.Masc
. SYM.Pun.Sent
Ein ART.Def.Nom.Sg.Masc*
Mann N.Reg.Nom.Sg.Masc
heiratet VFIN.Full.3.Sg.Pres.Ind
eine ART.Def.Nom.Sg.Fem*
Frau N.Reg.Nom.Sg.Fem
eines ART.Def.Gen.Sg.Masc*
Freundes N.Reg.Gen.Sg.Masc
. SYM.Pun.Sent
In German, “der”, “die, and “des” are definite articles while “ein”, “eine”, and “eines” are indefinite.
Looking at the RFTagger homepage, this contains the following sample output of RFTagger (which seems to use the same tagset):
Das PRO.Dem.Subst.-3.Nom.Sg.Neut
ist VFIN.Sein.3.Sg.Pres.Ind
ein ART.Indef.Nom.Sg.Masc
Testsatz N.Reg.Nom.Sg.Masc
. SYM.Pun.Sent
Having Wapiti tag the same sentence (using the German model I downloaded from your website) yields the following output:
Das PRO.Dem.Subst.-3.Nom.Sg.Neut
ist VFIN.Sein.3.Sg.Pres.Ind
ein ART.Def.Nom.Sg.Masc*
Testsatz N.Reg.Nom.Sg.Masc
. SYM.Pun.Sent
While RFTagger correctly classified “ein” as an indefinite article, Wapiti classifies it as a definite article.
Am I right in assuming this is an issue with the German model? What would be the best way to correct this?
@ThomasBarnekow each and every tagger (e.g. Wapiti, RFTagger, etc.) has a level of quality that depend heavily on the trained model. In any case, if you can tag a large test set and then compare accuracy, this will determine which tagger/model suits you best.
PS I did find that accuracy also depends on the implementation itself really. I trained CRF++, CRFSuite and Wapiti on the same data but end up with different results :)
@almasaud Thanks! I understand that there are differences between the trained models and the implementations. In this case, my assumption is that the training data might be wrong (e.g., because the tags for definite and indefinite articles might be swapped).
I'm just starting to play around with natural language processing. For example, I don't have any training corpora I could use to train Wapiti (or other taggers). Could you point me to something I could use for further testing?
@ThomasBarnekow That could be the case. However, there is still a chance that the model is just confused in this particular sentence.
Anyhow, just googled and found this dataset http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.en.html never tried it before but it looks good candidate.