jbesomi/texthero

TokenSeries as input to every representation function

jbesomi opened this issue · 1 comments

One of the principles of Texthero is to give to the NLP developer more control.

Motivation

A simple example is the TfidfVectorizer object from the scikit learn. It's absolutely fast and great but it has too many parameters and before applying the TF-IDF it actually preprocesses the text data. I just discovered that TfidfVectorizer even L2 normalizes the output and that there is no option to avoid a normalization.

With Texthero's tf-idf we just want the code to apply TF-IDF. That's it. No stopwords removal, no tokenization, no normalization. All this essential step can be done by the NLP developer on the pipeline (the drawback is that it might be less efficient, but at the advantage of having clear and expected behavior).

Solution

All representation functions will require the Pandas Series to be already tokenized. In the beginning, we can still accept Text Pandas Series; in this case the default hero.tokenize the function will be applied but a warning message will be outputted (see example below).

Interested in working on this task?
For the tfidf + term_frequency function, the code has already (almost) been made. The body of the function would look like this:

if type(s.iloc[0]) != list:
    raise ValueError(
        "🤔 It seems like the given Pandas Series is not tokenized. Have you tried passing the Series through `hero.tokenize(s)`?"
    )

tfidf = TfidfVectorizer(
    use_idf=True,
    max_features=max_features,
    min_df=min_df,
    max_df=max_df,
    tokenizer=lambda x: x,
    preprocessor=lambda x: x,
)

If you are interested in helping out, just leave a comment!

I'm working on this with @mk2510