Kind of Pandas Series
jbesomi opened this issue · 2 comments
Motivation
Having a unified view and a clear idea of the expected Pandas Series input it's useful both for the users and for the developers.
To receive precise and correct errors is very valuable for the users as this permits an easy and pleasant debugging. We can summarize three kinds of Pandas Series a Texthero's function can receive as input (or it can output):
Types
- "Pandas Text Series" --> every
cell
has some text - "Pandas Tokenized Series" --> every
cell
has a list of tokens - "Pandas Representation Series" --> every cell is a representation of a text ( it's a list of
float
values). This will be improved soon (See issue #43)
In the best scenario, every Texthero's function receive as input a Pandas Series of one of these three kind. Testing that the given Pandas Series is of the right expected types is therefore useful.
Go further
preprocess.py
: almost all function (at the exception oftokenize
) takes as input a Pandas Text Series and Return a Pandas Text Series.represention.py
: input (will be #44 ) a Tokenized Pandas Series and output will be Representation Pandas Seriesnlp.py
: input is a Text Pandas Series, whereas the output is TODOvisualization.py
TODO.
It would be great to have a unified and clear view of all this:
- Every function should check for the right type (we will need to define the "check" function, probably under a new file, something like
_helper.py
) - Once everything is in place and defined, add under the website (documentation) a clear document that explain all this. It will be so easy to use Texthero then!
- New ideas
Extra
Unfortunately, there are more variants of Pandas Series (output of named_entities
, output of pca
, ...) there is still some design work to go there ...
Work in progress ...
I just opened a first draft PR in #69 as a step to implementing this. I'll copy the new file _helper.py 's docstring here:
Hero Series Types
There are different kinds of Pandas Series used in the library, depending on use.
For example, the functions in preprocessing.py usually take as input a Series
where every cell is a string, and return as output a Series where every cell
is a string. To make handling the different types easier (and most importantly
intuitive for users), this file implements the types as subclasses of Pandas
Series and defines functions to check the types.
These are the implemented types:
- TextSeries: cells are text (i.e. strings), e.g. "Test"
- TokenSeries: cells are lists of tokens (i.e. lists of strings), e.g. ["word1", "word2"]
- RepresentationSeries: cells are vector representations of text (see issue #43), e.g. [0.25, 0.75]
You could now do this:
@OutputSeries(RepresentationSeries)
@InputSeries(TokenSeries)
def tfidf(s: TokenSeries) -> RepresentationSeries:
...
The decorators (@...) make python check whether the input is valid
and transform the output into the correct type,
which leads to easier code and exception handling (no need to write
"if not is_text_series(s): raise ..." in every function) and easy
modification/expansion later on. It will automatically throw the correct error
if the input pandas Series is not a list of words in every cell (as it expects a TokenSeries).
Users do not have to use
the custom types like TokenSeries themselves! They can just use
a normal Pandas Series, and they can immediately see from
the function header that their input should look / behave like
a TokenSeries, and that their output will be a RepresentationSeries.
The typing helps the users understand the code more easily
as they'll be able to see immediately from the documentation
on what types of Series a function operates. This is much more
verbose and clearer than e.g. "tfidf(s: pd.Series) -> pd.Series".
Note that users can of course still simply
use ordinary pd.Series objects.
The functions will then just check if the Series could be
e.g. a TextSeries (so it checks the properties) to give maximum flexibility.
The custom types are subclasses of pd.Series anyway. Thus,
the types enable better documentation and expressiveness
of the code and do not mean that a user really has to pass
a e.g. TextSeries; what he passes just has to have the properties
of one.
Example: user has standard pd.Series s and wants to clean the text.
Calling hero.clean(s), the clean function will check whether s
could be a TextSeries. If yes, it proceeds with the cleaning
and returns a TextSeries. If no, an error is thrown with
a good explaination.
Concerning performance, a user might often have a Series s on which
different operations will be performed. The behaviour will be as follows:
s = pd.Series("test")
s = hero.remove_punctuation(s)
# hero.remove_punctuation first checked if s can be a TextSeries.
# That is the case, so the function was applied as usual.
# The output was then transformed to a TextSeries, without
# the user noticing. If now something like this is done:
s = hero.remove_diacritics(s)
# the remove_diacritics function will immediately notice
# that s is a TextSeries, so the check is O(1) through isinstance.
(NOTE: this could lead to problems later on, if e.g. a user
changes s after remove_punctuation, then the library still
treats it as a TextSeries even though the user might have
applied functions from e.g a different library such that s does not
fulfill the "TextSeries" properties anymore. The error messages
would then be not as good.)
The classes are lightweight subclasses of pd.Series and serve 2 purposes:
- Good documentation for users through docstring.
- Function(s) to check if a pd.Series has the required properties.
More Examples
import pandas as pd
from texthero._helper import * # Bad style
@OutputSeries(TextSeries)
@InputSeries(TextSeries)
def do_nothing(s: TextSeries) -> TextSeries:
return s
t = do_nothing(pd.Series("test"))
t
# 0 test
# dtype: object
type(t)
# TextSeries
do_nothing(pd.Series([1.0])) # not a TextSeries
TypeError(...) # (error message is good; too long so left out here)
do_nothing("test") # not a TextSeries
TypeError(...)
These are just some simple examples. As you can see, this makes it easy for the vast majority of functions to implement checking the correct type through the decorators. It also makes it easier for users to use the library as they immediately know what kind of Series they will give as input / receive as output.