Incorrect classification of definite and indefinite articles in German

Question

Incorrect classification of definite and indefinite articles in German

Closed this issue 3 years ago · 3 comments

Thank you very much for sharing Wapiti. This is really awesome.

Before I describe a potential issue with the German model (my hypothesis) that can be downloaded from your homepage, let me say that I have little experience with NLP or linguistics. Thus, I might be completely wrong.

Having played around with Wapiti (and now also RFTagger), it seems Wapiti consistently swaps the classification of definite and indefinite articles. For example:

Der ART.Indef.Nom.Sg.Masc*
Mann    N.Reg.Nom.Sg.Masc
heiratet    VFIN.Full.3.Sg.Pres.Ind
die ART.Indef.Nom.Sg.Fem*
Schwester   N.Reg.Nom.Sg.Fem
des ART.Indef.Gen.Sg.Masc*
Freundes    N.Reg.Gen.Sg.Masc
.   SYM.Pun.Sent

Ein ART.Def.Nom.Sg.Masc*
Mann    N.Reg.Nom.Sg.Masc
heiratet    VFIN.Full.3.Sg.Pres.Ind
eine    ART.Def.Nom.Sg.Fem*
Frau    N.Reg.Nom.Sg.Fem
eines   ART.Def.Gen.Sg.Masc*
Freundes    N.Reg.Gen.Sg.Masc
.   SYM.Pun.Sent

In German, “der”, “die, and “des” are definite articles while “ein”, “eine”, and “eines” are indefinite.

Looking at the RFTagger homepage, this contains the following sample output of RFTagger (which seems to use the same tagset):

Das PRO.Dem.Subst.-3.Nom.Sg.Neut 
ist VFIN.Sein.3.Sg.Pres.Ind 
ein ART.Indef.Nom.Sg.Masc 
Testsatz    N.Reg.Nom.Sg.Masc 
.   SYM.Pun.Sent

Having Wapiti tag the same sentence (using the German model I downloaded from your website) yields the following output:

Das PRO.Dem.Subst.-3.Nom.Sg.Neut
ist VFIN.Sein.3.Sg.Pres.Ind
ein ART.Def.Nom.Sg.Masc*
Testsatz    N.Reg.Nom.Sg.Masc
.   SYM.Pun.Sent

While RFTagger correctly classified “ein” as an indefinite article, Wapiti classifies it as a definite article.

Am I right in assuming this is an issue with the German model? What would be the best way to correct this?

Answer 1 · 2016-04-17T19:09:02.000Z

@ThomasBarnekow each and every tagger (e.g. Wapiti, RFTagger, etc.) has a level of quality that depend heavily on the trained model. In any case, if you can tag a large test set and then compare accuracy, this will determine which tagger/model suits you best.

PS I did find that accuracy also depends on the implementation itself really. I trained CRF++, CRFSuite and Wapiti on the same data but end up with different results :)

Answer 2 · 2016-04-17T20:20:06.000Z

@almasaud Thanks! I understand that there are differences between the trained models and the implementations. In this case, my assumption is that the training data might be wrong (e.g., because the tags for definite and indefinite articles might be swapped).

I'm just starting to play around with natural language processing. For example, I don't have any training corpora I could use to train Wapiti (or other taggers). Could you point me to something I could use for further testing?

Answer 3 · 2016-04-17T22:22:42.000Z

@ThomasBarnekow That could be the case. However, there is still a chance that the model is just confused in this particular sentence.

Anyhow, just googled and found this dataset http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.en.html never tried it before but it looks good candidate.