Universal POS tags don't discern singular/plural nominals
asharkinasuit opened this issue · 5 comments
The only tags we get are NOUN, PROPN (proper noun) and PRON (pronun). This means the config.ini entry pos_agree_mapping
will not work. Fortunately the input format also specifies a column where the universal features can be specified, which do include information on the number (and other things), see. Could it become an option to incorporate the information offered by this column?
In UD, the morphological features are meant to be included in the features column and not in the POS tag, as you note. This is also the case for many native tagsets, for example STTS for German. If you look at the online demo of xrenner, the German model takes gender information directly from that column.
It would be easy to add another configuration, but I didn't think that would be necessary: if you have a column explicitly specifying the agreement category, what do you need the mapping for?
Maybe I should rephrase my question: does xrenner take into account information offered by the features column? If so, then obviously this is a non-issue.
Yes, it does, but it expects a single agreement category to be listed in that column, not a list of key-value pairs. However, you can use the morph_rules
setting to edit the values. Here is an example from the German model, which takes RFTagger morphological analyses and reduces them to one of Fem/Masc/Neut/Pl or 2nd/1st person agreement:
# Edit morphology information - cascade of string replace rules to use on the morph field in conll data if available
morph_rules=.*([12]).*(Sg|Pl).*/\1\2;([12])Sg/\1;^[^0-9].*(Pl).*/\1;^[^0-9].*(Fem|Masc|Neut).*/\1;.*\.\*$/_
This changes things like:
- N.Reg.Nom.Sg.Neut
- PRO.Pers.Subst.3.Nom.Pl.*
- N.Reg.Dat.Sg.Fem
To:
- Neut
- Pl
- Fem
I think it should be possible to do something similar that will work for Dutch with UD features.
BTW if you have a model for Dutch and are willing to contribute it, I'm happy to credit you and add it to the distribution!
I'm sorry for taking so long with this, but I think the solution you propose here doesn't work out with the way the agreement class is inferred from pronouns. Specifically, it looks like xrenner_marker.py:132
is written with the English model in mind, where the third person is only specified as 'male' or 'female'.
What I did with the above solution is to look at what I could expect from my corpus in the way of pronoun information and then fil in the pronouns.tab accordingly, so for instance a 3rd person pronoun can be 3MascSing, 3FemSing, 3NeutSing, 3Plur or just 3 (in case of the reflexive), which means the check on the line mentioned above will fail to detect a human third person. I would imagine more problems like this would occur for other languages as well, like Italian, where the plural form discerns masculine and feminine as well.
I suppose the most robust way of solving this would be to parametrize the way categories like grammatical person and number are specified, just like subject_func
, possessive_func
etc are parametrized. A different way might be to note in the documentation that for this (too) the information from UD (or some other standard) should be used where appropriate so you can rely on people to use Masc/Sing/etc.
Thanks for pointing out that line, it is indeed a throwback to the early days of the system, when more things were hardwired to the English pilot phase. You're right that this line is useless for languages like Dutch.
However, I don't think you should have problems despite this, or at least based on my superficial understanding of Dutch as fairly similar to German. For German too, third person gender does not tell us whether the entity is a person, and the general handling of such correspondences is actually intended to be covered by the general config.ini
setting agree_entity_mapping
. I would also avoid having a special agreement class for the reflexive, and just list it multiple times with the different genders. This is what it looks like in the German model pronouns.tab:
...
sich Fem
sich Masc
sich Neut
sich Pl
...
Since the gender doesn't tell you anything outright about the entity, the facts that the check on line 132 doesn't trigger shouldn't hurt anything (though I agree it should be refactored out!).