UniversalDependencies/tools

Validation rule for Foreign feature

Closed this issue · 3 comments

bguil commented

Using validate.py for some French data, I had the following error:

[L4 Morpho feature-upos-not-permitted] Feature Foreign is not permitted with UPOS X in language [br]

for the CoNLL line:

11	maen	maen	X	_	Foreign=Yes	10	appos	_	Lang=br|SpaceAfter=No

I think it would be sensible to allow the feature Foreign=Yes on the X tag whatever is the language.

In general I agree (not only for the X tag but perhaps for any tag). But I am hesitant to hard-code it in the validator when checking the boxes in the form is not too much work and it is then neatly visible alongside all other features.

There is one issue though that I have not solved yet and that makes the feature Foreign special anyway. The attribute Lang=br in MISC indicates that morphological features in FEATS, if present, are Breton rather than French. However, the feature Foreign should probably be a (hard-coded) exception because:

  • The word is foreign in French but not in Breton
  • Nevertheless, we still want to be able to quickly filter out words that are foreign in the corpus (whose main language is French)
  • We want the feature there regardless of whether Lang=br is or is not in MISC. (Sometimes the code of the source language is not available, some corpora use only Foreign=Yes without Lang=xx etc.)

I might also suggest taking a look at the guideline suggestions for UGC (Section 4.7) in our recent journal article:

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations
https://link.springer.com/content/pdf/10.1007/s10579-022-09581-9.pdf

The validator should now judge the Foreign feature according to the main language of the corpus, regardless of Lang=xx in MISC.