Validation rule for Foreign feature
Closed this issue · 3 comments
Using validate.py
for some French data, I had the following error:
[L4 Morpho feature-upos-not-permitted] Feature Foreign is not permitted with UPOS X in language [br]
for the CoNLL line:
11 maen maen X _ Foreign=Yes 10 appos _ Lang=br|SpaceAfter=No
I think it would be sensible to allow the feature Foreign=Yes
on the X
tag whatever is the language.
In general I agree (not only for the X
tag but perhaps for any tag). But I am hesitant to hard-code it in the validator when checking the boxes in the form is not too much work and it is then neatly visible alongside all other features.
There is one issue though that I have not solved yet and that makes the feature Foreign
special anyway. The attribute Lang=br
in MISC indicates that morphological features in FEATS, if present, are Breton rather than French. However, the feature Foreign
should probably be a (hard-coded) exception because:
- The word is foreign in French but not in Breton
- Nevertheless, we still want to be able to quickly filter out words that are foreign in the corpus (whose main language is French)
- We want the feature there regardless of whether
Lang=br
is or is not in MISC. (Sometimes the code of the source language is not available, some corpora use onlyForeign=Yes
withoutLang=xx
etc.)
I might also suggest taking a look at the guideline suggestions for UGC (Section 4.7) in our recent journal article:
Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations
https://link.springer.com/content/pdf/10.1007/s10579-022-09581-9.pdf
The validator should now judge the Foreign
feature according to the main language of the corpus, regardless of Lang=xx
in MISC.