polm/fugashi

Lemmatizing particles に、で

ruukasu3 opened this issue · 3 comments

I've run into a situation in my data pre-processing where I've found instances of particles に and で lemmatizing into the hiragana だ (I've included the two examples below). It is not clear to me why this is happening. I would expect ように to become "様" and "に", and までで to become "まで" and "で". I'm concerned that this will cause issues when I process the text through my model.

image
image

Unidic specifies that for "に", "で, and "な" that the lemma is "だ". I'm pretty sure the reasoning here is that for na-adjectives and nouns, they get conjugated with "だ" in the terminal form, but "な" in attributive, "で" in the continuative, "に" in the adverbial (see https://en.wiktionary.org/wiki/%E6%A7%98#Inflection). You can see some similar behavior with the lemmas for 出来る and 成る, where it's the base terminal form that's the lemma rather than the surface conjugation of でき/出来 and なり/成り.

Okay, that makes sense, thank you!

polm commented

In the future please post code or text as text, not as images - images of code make it impossible to copy/paste or search and much harder to read.

@mmcauliffe's answer is right, and this is basically conventional analysis of lemmas, though に in particular does come up less often than the other ones.