jbesomi/texthero

Preprocessing: explain how to create a custom pipeline

jbesomi opened this issue · 5 comments

(Edit)

Add under Getting Started - Preprocessing a section that explains how to create a custom pipeline. This solution is easier than #9

Explain in the docstring of clean how to create a custom pipeline. Code example:

import texthero as hero
import pandas as pd

s = pd.Series(["is is a stopword"])
custom_set_of_stopwords = ['is']

pipeline = [
    lambda s: hero.remove_stopwords(s, stopwords=custom_set_of_stopwords)
]

s.pipe(clean, pipeline=pipeline)

Hey @jbesomi, should we include all methods in the default clean pipeline to the example?

Hey Cedric, what do you mean with that? The idea here is to explain how to generate a custom pipeline ...

@jbesomi, sorry it wasn't clear. I'm talking about how to explain generating a custom pipeline in clean's docstring.

import texthero as hero
import pandas as pd

s = pd.Series(["is is a stopword"])
custom_set_of_stopwords = ['is']

pipeline = [
    lambda s: hero.remove_stopwords(s, stopwords=custom_set_of_stopwords)
]

s.pipe(clean, pipeline=pipeline)

^^This example that you gave shows how to customize pipeline by passing a custom set of stop words to remove_stopwords. I'm wondering if you want to add more examples in the docstring which shows other ways to customize the pipeline.
For example, showing how to use only some of the methods in the default pipeline:

from texthero import preprocessing

custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_whitespace]
df['clean_text'] = hero.clean(df['text'], custom_pipeline)

Actually, I'm not sure if it's a good idea to show more one than example in the docstring, maybe a more detailed explanation should be in a separate Getting Started - Preprocessing section as you suggested.

The second code you showed is basically the once from get_default_pipeline. The discussion here was more intended to show how to create a custom pipeline with functions that might require other arguments as input ...

I see, that is much clearer now. Can you give me some pointers on what should be added to the docstring? Is adding the code below in Examples enough?

import pandas as pd

s = pd.Series(["is is a stopword"])
custom_set_of_stopwords = ['is']

pipeline = [
    lambda s: hero.remove_stopwords(s, stopwords=custom_set_of_stopwords)
]

s.pipe(clean, pipeline=pipeline)