Feature request: handle compounds in lemma
inariksit opened this issue · 0 comments
An example input:
1 Sapiteed sapi_tee PROPN S Case=Par|Number=Sing 0 root _ _
2 tavalaiusega tavalaius NOUN S Case=Com|Number=Sing 1 nmod _ _
I would like ud2gf to try to parse sapi_tee
in the following order:
a. Merge the lemma into sapitee and try to parse it. If it is found in the lexicon, return sapitee_N
.
b. If sapitee is not in the lexicon, then try parsing both sapi and tee. If they are both nouns, return CompoundN sapi_N tee_N
.
c. If only tee is found in the lexicon, return StrCompoundN "sapi" tee_N
.
d. If none of sapi or tee is in the lexicon, then proceed to morpho_analyze
the wordform, i.e. "sapiteed". That's because the lemma may have been wrongly analysed.
f. If ma "sapiteed"
didn't return anything either, as a last resort we return StrN <something>
. That something can be
- lemma without the underscore, so
StrN "sapitee"
- wordform as is, so
StrN "sapiteed"
.
The same applies for compound adjectives, verbs etc. This assumes that the grammar has the backup functions StrC
and StrCompoundC
(which may become a command line option, see #24. But for now, when it's not command line option, we can just introduce those functions in ud2gf, and leave it to the grammarian to add them to grammar.)
Interaction with morpho_analyse
As of April 2022, ud2gf first tries to parse the lemma, and only secondarily does ma
on the word form. If the default behaviour changes, this proposed algorithm should be reconsidered too.