jbesomi/texthero

Add Lemmatization

henrifroese opened this issue Β· 15 comments

Lemmatization can be thought of as a more advanced stemming that we already have in the preprocessing module. You can read about it e.g. here. Implementation should be done with spaCy.

ToDo

Implement a function hero.lemmatize(s: TokenSeries) (or mayber rather TextSeries?). Using spaCy this should be fairly straightforward. It should go into the NLP module and probably look very similar to the other spacy-based functions there.

Just comment below if you want to work on this and/or have any questions. I think this is a good first issue for new contributors.

Hi, I would like to work on this issue.

That's great!

Regarding the function signature, not sure which one is better (maybe even accept both Series type might make sense?). But, it will depend on the implementation requirements. Probably with spaCy we will have to first tokenize the text and then take the lemma with token.lemma_ or something similar ... so maybe we need to pass a TextSeries? (as we will need to pass the sentence into nlp(sentence))

Also, independently of the choice, both stem (should we rename it to stemming) and lemming should probably have the same signature.

Regards,

Using the token.lemma_ attribute, a straightforward implementation would be something like this:

 lemmatized_docs = []

 nlp = spacy.load("en_core_web_sm", disable=["ner"])

 for doc in nlp.pipe(s.astype("unicode").values, batch_size=32):
     lemmatized_docs.append(" ".join([word.lemma_ for word in doc]))

 return pd.Series(lemmatized_docs, index=s.index)

which is very similar to other functions in the nlp.py folder. And the nlp does require a string as input and not a token so I think it makes sense to make it a TextSeries signature. I set batch_size to 32 here because thats how it is in all the other functions, but I don't know if and why that's a good choice.

What I found to be a problem is that token.lemma_ seems to replace any pronouns with -PRON-. For example:

>>>  s = pd.Series(["My name is Adrian"])
>>>  hero.lemmatize(s)
0    -PRON- name be adrian
dtype: object

We don't really want that, right? What do you guys think?

I agree we probably do not want that πŸ™… . Have you found an alternative solution? πŸ₯‰

Maybe just replacing every -PRON- with word.text?

The first question is, is that expected or a mistake from spaCy? "My" in this case is a possessive pronoun and therefore "-PRON-" make somehow sense.

I asked about this on the spaCy gitter but no answers for now.

Update: read Annotation Specifications:

About spaCy's custom pronoun lemma for English
spaCy adds a special case for English pronouns: all English pronouns are lemmatized to the special token -PRON-. Unlike verbs and common nouns, there’s no clear base form of a personal pronoun. Should the lemma of β€œme” be β€œI”, or should we normalize person as well, giving β€œit” β€” or maybe β€œhe”? spaCy’s solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns.

So I guess that's right. We will just to make it clear in the function docstring.

Yes, I would suggest doing it just like @AlfredWGA said. So all the pronouns are not getting lemmatized and we'll make that clear in the docstring.

If the guys at spaCy decided to replace pronouns with -PRON- there must be a reason. I guess that's because when we lemmatize a text, we want to normalize it. With the -PRON- token we are achieving this.

What I suggest is to keep as default their approach but to give the user the opportunity to replace -PRON- with the respective token throughout a keep_pron (or keep_pronouns) parameter or something similar.

Great idea @jbesomi !

There are also some inconsistencies concerning punctuation and apostrophes. When spacy tokenizes the text it seperates punctuation from words. So when joining the tokens back together there will be free spaces where before there were none. Some ideas how to deal with this are

  1. Removing punctuation before lemmatizing with hero.remove_punctuation:
    This should not be done however because spacy won't recognize certain expressions anymore after removing punctuation. For example the expression "I'm" will become "I m" and then spacy won't recognize the "m" as the verb "be" whereas otherwise it would recognize it.

  2. Output the Spacy tokens as a TokenSeries:
    However Spacy sometimes tokenizes in a different way than hero.tokenize would do. For example texthero will not seperate "world's" but spacy would tokenize it into [world, 's]. So outputting the spacy token as a TokenSeries would be inconsistent.

Any other ideas? Or is it ok to have some more spaces in the lemmatized text? We can always remove punctuation afterwards.

So to clarify my comment above here you see how the function creates extra spaces:

>>> s = pd.Series(["I'm here.", "The world's biggest pumpkin"])
>>> lemmatize(s)
0            -PRON- be here .
1    the world 's big pumpkin
dtype: object

Notice the shift of the period and the apostrophe.

When removing punctuation first this happens:

>>> s = pd.Series(["I'm here.", "The world's biggest pumpkin"]).pipe(remove_punctuation)
>>> lemmatize(s)
0              -PRON- m here
1    the world s big pumpkin
dtype: object

As you can see the "m" could not be lemmatized to "be". The lemma_ attribute from spacy seems to be incompatible with hero.remove_punctuation if punctuation is removed before lemmatizing.

Another idea is that we don't need to care about extra spaces if we just output the tokens that spacy creates as a TokenSeries. This would look like this:

>>> lemmatize_with_tokenization(s)
0             [-PRON-, be, here, .]
1    [the, world, 's, big, pumpkin]
dtype: object

>>> hero.tokenize(s)
0                      [I'm, here, .]
1    [The, world's, biggest, pumpkin]
dtype: object

Here you can see, however, that spacy tokenizes differently than hero.tokenize because "world's" is getting split up. So outputting the result as a TokenSeries would be very inconsistent and would lead to problems.

If we want the input to be already tokenized then tokens like [world's] would be split into two tokens anyway so I don't think that would solve the problem directly either.

I think it might be best to accept the extra spaces because it's likely going to get preprocessed further afterwards and if we then apply hero.remove_punctuation and hero.remove_whitespace, after lemmatizing, it will have the same format again as any other text with this preprocessing.

I agree it looks like the best option is allowing the spaces.

Do you want to prepare a Pull Request for that?

Yes, I'll prepare the Pull Request

I think this issue can be closed now.

Thanks!