All preprocessing functions to receive as input TokenSeries
jbesomi opened this issue · 4 comments
The aim of this issue is to discuss and understand when tokenize
should happen in the pipeline.
The current solution is to apply tokenize
once the text has already been cleaned, either with clean
or with a custom pipeline. In general, in the cleaning phase, we also remove the punctuation symbols.
The problem with this approach is that, especially for non-Western languages (#18 and #128), the tokenization operation might actually need the punctuation to execute correctly.
The natural question is: wouldn't be better to have as very first operation tokenize
?
In this scenario, all preprocessing functions would receive as input a TokenSeries
. As we care about performance, one question is whereas we can develop a remove_punctuation
enough efficient with TokenSeries
. The current version of the tokenize
function is quite efficient as it makes use of regex. The first task would be to develop the new variant and benchmark it against the current one. An advantage of the non-regex approach is that as the input is a list of lists, we might empower parallelization.
Could we move tokenize
at the very first step yet keeping performance high? Which solution offer the fastest performance?
The other question is: is there a scenario where preprocessing functions should deal with TextSeries
rather than TokenSeries
?
Extra crunch:
The current tokenize
version uses a very naive approach based on regex that works only for Western languages. The main advantage is that it's quite fast compared to NLTK or other solutions. An alternative we should seriously consider is to replace the regex version with the SpaCy tokenizer (#131). The question is: how can we tokenize
with SpaCy in a very efficient fashion?
To keep in mind: one of the advantages of the "clean" section of pre-processing is the possibility to clean in a uniform way small strings (e.g. Names, Addresses, etc.) in a dataset. This, although little when compared to the overall benefits of using the whole pipeline on big chunks of text, could be an interesting pre-step to string matching operations. They are very common in some research contexts, where you have to merge different dataset based for instance on Company Name or Scientific Publications authors. Would advancing "tokenize" in the pipeline prevent this use of TextHero?
Very interesting observation.
For the case you mentioned, we can tokenize
, clean
(probably with a custom pipeline and normalization) and then join back the tokens. Do you see any drawbacks?
>>> s = pd.Series(["Madrid", "madrid, the", "Madrid!"])
>>> s = hero.tokenize(s)
>>> s = hero.clean(s)
>>> s = s.str.join("")
0 madrid
1 madrid
2 madrid
(out of the discussion) -> soon or later we will have to think about how to add a universal hero.merge
/hero.join
function to merge DataFrame
with string-columns (Pandas merge works only on perfectly equal strings). A (naive) approach might be totokenize
(probably at the sub-level, to be implemented), compute embeddings (section 4 of #85 with flair) and merge cells that shares very similar vectors (somehow related to #45 ).
@henrifroese, would you mind help us with that? As you are already familiar with the Series subject.
@jbesomi with what part do you need help? Or in general?
I think that overall, as described in #131 the spaCy version without parallelization is too slow to be useful for texthero. With the spaCy parallelization it's still a lot slower than the regex version, but useable, and with the parallelization from #162 , it's pretty fast and useable.
However, I'm not 100% convinced we should always tokenize first. I think the point mentioned by @Iota87 is correct that there are users who mainly use cleaning functions etc. and it's a little annoying and counterintuitive having to tokenize, clean, then join again.
Additionally, this would of course be a pretty big development effort with needing to change a lot of functionality in the preprocessing module and tests, so I want to make sure this really is necessary 🥵