Lemmatizing particles に、で
ruukasu3 opened this issue · 3 comments
I've run into a situation in my data pre-processing where I've found instances of particles に and で lemmatizing into the hiragana だ (I've included the two examples below). It is not clear to me why this is happening. I would expect ように to become "様" and "に", and までで to become "まで" and "で". I'm concerned that this will cause issues when I process the text through my model.
Unidic specifies that for "に", "で, and "な" that the lemma is "だ". I'm pretty sure the reasoning here is that for na-adjectives and nouns, they get conjugated with "だ" in the terminal form, but "な" in attributive, "で" in the continuative, "に" in the adverbial (see https://en.wiktionary.org/wiki/%E6%A7%98#Inflection). You can see some similar behavior with the lemmas for 出来る and 成る, where it's the base terminal form that's the lemma rather than the surface conjugation of でき/出来 and なり/成り.
Okay, that makes sense, thank you!
In the future please post code or text as text, not as images - images of code make it impossible to copy/paste or search and much harder to read.
@mmcauliffe's answer is right, and this is basically conventional analysis of lemmas, though に in particular does come up less often than the other ones.