TimSchopf/KeyphraseVectorizers

Spacy tagger is not available in French

hboisgibault opened this issue · 11 comments

Spacy doesn't provide a tagger for French.
It would be great to have an option to provide a custom tagger.
Since spacy is used, it could be added as a custom pipeline component.

Spacy taggers may not be enough and some users may want to use custom taggers, even for supported languages.

Thanks for the feedback.

Yes the current solution is limited to spaCy pipelines that already provide a build in tagger. I agree that the option to define custom taggers would be useful. I could imagine implementing this feature, using an additional argument that can receive a tagger callable as parameter.

What do you think about that?

Hi Tim,
That's a good idea. I started implementing this last week and added a "pos_tagger" argument to provide a custom tagger.
I was able to create a custom spacy tagger and pass it to the pipeline. Can suggest a PR for this.

However, it is limited to spacy, and since I want to use a tagger through the flair library, it is a bit difficult to use.
What the library is expecting is a list of tuples (TOKEN, TAG) to do the phrase selection, and I think it would be good to have a more general way to provide this list, but I don't have a good solution for it, since the spacy pipeline is used.

It seems that there is a corpus "fr_dep_news_trf" that uses transformers and that has a tagger for French. I will test it with the library and give some feedback.

To follow up on this, it is possible to use the corpus "fr_dep_news_trf" to handle French text.

However, there are a few steps to make it work :

  • install spacy-transformers library
  • in the spacy pipeline, three components need to be added : parser, morphologizer and transformer
    Since the components are not added by default in the pipeline, there could be an option to add them, or have an option to pass a custom list of spacy components.

I started a branch to add the components by default : https://github.com/Logora/KeyphraseVectorizers/tree/use_lemmatizer

hi @hboisgibault ,

it is possible to have more information ? i added "fr_dep_news_trf" but how pass "parser, morphologizer and transformer" in the pipeline ?

Hi @devnumber10, you need to fork the repo or use my fork. You can see the line here where I added the components :
Link

Hi @hboisgibault , thanks for you help. Now i am using your fork. I see that you have "pos_tagger" paramater. What i have to use ? I want group of keyword with 1 or more nouns.

vectorizer = KeyphraseCountVectorizer(spacy_pipeline='fr_dep_news_trf', pos_pattern='<N.*>+', stop_words='french', pos_tagger=)

"pos_tagger" is if you want to pass a custom tagger component. But it is not needed if you use the pipeline you are using. You can omit this parameter.

From release v0.0.9 the fr_dep_news_trf pipeline can be used in KeyphraseVectorizers without the need of forking the version of @hboisgibault. I removed the default exclusion of certain spaCy pipeline components. This slightly slows down the keyphrase extraction process. However it grants higher compatibility to all available spaCy pipelines, including this one.

With the v0.0.10 release I added the option to use a custom POS-tagger. You can check out how it works here.

Thanks Tim for the update !