Lemmatizing particles に、で

Question

Lemmatizing particles に、で

ruukasu3 opened this issue 2 years ago · 3 comments

I've run into a situation in my data pre-processing where I've found instances of particles に and で lemmatizing into the hiragana だ (I've included the two examples below). It is not clear to me why this is happening. I would expect ように to become "様" and "に", and までで to become "まで" and "で". I'm concerned that this will cause issues when I process the text through my model.

Answer 1 · 2023-04-10T21:01:48.000Z

Unidic specifies that for "に", "で, and "な" that the lemma is "だ". I'm pretty sure the reasoning here is that for na-adjectives and nouns, they get conjugated with "だ" in the terminal form, but "な" in attributive, "で" in the continuative, "に" in the adverbial (see https://en.wiktionary.org/wiki/%E6%A7%98#Inflection). You can see some similar behavior with the lemmas for 出来る and 成る, where it's the base terminal form that's the lemma rather than the surface conjugation of でき/出来 and なり/成り.

Answer 2 · 2023-04-12T03:57:20.000Z

Okay, that makes sense, thank you!

Answer 3 · 2023-04-13T04:14:13.000Z

In the future please post code or text as text, not as images - images of code make it impossible to copy/paste or search and much harder to read.

@mmcauliffe's answer is right, and this is basically conventional analysis of lemmas, though に in particular does come up less often than the other ones.