snguyenthanh/better_profanity

Version 0.7.0 significantly slower than 0.6.1

virtustate opened this issue · 10 comments

We've been using this great module in a larger analysis for months: thank you for making it available! Here's the relevant code:

corpus = reviews['review_text']
profanity.add_censor_words([x.lower() for x in other_stop_words])
corpus = corpus.apply(profanity.censor, args=(' ',))

The statement corpus.apply(profanity.censor, args=(' ',)) is taking a couple orders of magnitude longer using version 0.7.0 than 0.6.1. Here are some timings with everything the same other than the better_profanity version. "Time to apply profanity" is for just corpus = corpus.apply(profanity.censor, args=(' ',))

better_profanity=0.7.0

This product is named Oasis High-Waisted Pocket Capri
Begin by noting there are 1669 reviews for this product
Time to apply profanity: 95.60132622718811
Time it takes to run this LDA model: 97.7805495262146
{'size', 'material', 'working', 'comfortable', 'waist', 'length', 'fit', 'fabric', 'soft'}
1 of 122 products' review sets remodeled

This product is named Ryan Built In Bra Tank II
Begin by noting there are 427 reviews for this product
Time to apply profanity: 29.865559816360474
Time it takes to run this LDA model: 30.556731939315796
{'cute', 'size', 'top', 'comfortable', 'fit'}
2 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Pocket 7/8
Begin by noting there are 10934 reviews for this product
Time to apply profanity: 710.3287241458893
Time it takes to run this LDA model: 726.3966491222382
{'pocket', 'comfortable', 'feel', 'waist', 'fit', 'see', 'color', 'soft'}
3 of 122 products' review sets remodeled

This product is named High-Waisted Ultracool Side Stripe Crop
Begin by noting there are 168 reviews for this product
Time to apply profanity: 10.014711618423462
Time it takes to run this LDA model: 10.347350835800171
{'comfortable', 'feel', 'waist', 'size'}
4 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Twist 7/8
Begin by noting there are 1750 reviews for this product
Time to apply profanity: 121.31187510490417
Time it takes to run this LDA model: 123.6646056175232
{'cute', 'style', 'size', 'material', 'comfortable', 'detail', 'bit', 'fit', 'color', 'soft'}
5 of 122 products' review sets remodeled

better_profanity=0.6.1

This product is named Oasis High-Waisted Pocket Capri
Begin by noting there are 1669 reviews for this product
Time to apply profanity: 0.19291996955871582
Time it takes to run this LDA model: 4.058649063110352
{'size', 'material', 'working', 'comfortable', 'waist', 'length', 'fit', 'fabric', 'soft'}
1 of 122 products' review sets remodeled

This product is named Ryan Built In Bra Tank II
Begin by noting there are 427 reviews for this product
Time to apply profanity: 0.05718731880187988
Time it takes to run this LDA model: 0.7385601997375488
{'cute', 'size', 'top', 'comfortable', 'fit'}
2 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Pocket 7/8
Begin by noting there are 10934 reviews for this product
Time to apply profanity: 1.264852523803711
Time it takes to run this LDA model: 17.08655619621277
{'pocket', 'size', 'comfortable', 'waist', 'fit', 'around', 'color', 'amazing'}
3 of 122 products' review sets remodeled

This product is named High-Waisted Ultracool Side Stripe Crop
Begin by noting there are 168 reviews for this product
Time to apply profanity: 0.018624067306518555
Time it takes to run this LDA model: 0.34430885314941406
{'comfortable', 'feel', 'waist', 'size'}
4 of 122 products' review sets remodeled

This product is named Oasis High-Waisted Twist 7/8
Begin by noting there are 1750 reviews for this product
Time to apply profanity: 0.2005002498626709
Time it takes to run this LDA model: 2.5792129039764404
{'cute', 'style', 'size', 'material', 'comfortable', 'detail', 'bit', 'fit', 'color', 'soft'}
5 of 122 products' review sets remodeled

Hi @virtustate, thank you for using the library and reporting the issue.

Could you please share some texts and censor_words which the .censor function runs on ?

import pandas
import better_profanity
from better_profanity import profanity
import time

df = pandas.read_csv('datasetSentences.txt', sep='\t')
df = df[1:100]
start = time.time()
corpus = df['sentence']
corpus = corpus.apply(better_profanity.profanity.censor, args=(' ',))
print(time.time() - start, [better_profanity.version)]([url](url
datasetSplit.txt
))

I'm getting about 30s with 0.7.0 and 0.04s with 0.6.1 debug, 4s with 0.7.0 and .013s with 0.6.1 run.

Interesting. 0.7.0 reduces the extra memory consumption for the word variants, but somehow increases the runtime.
I will look into it by the weekend and get back to you.

I suggest using version 0.6.1 in the meantime, though the memory usage could be a bit high.

Thanks, no hurry. 0.6.1 is working well for us without any memory issues.

For some context, we made some fairly substantial changes to the underlying data-structure used to compare strings. The changes and their justification are found in pull request #17.

The benchmarking I did between 0.6.1 and 0.7.0 just used the unit tests, and I found 0.7.0 to be substantially faster. I made the assumption that their overall runtime was representative of most use cases. This assumption was incorrect, and evidently the unit tests were only faster because we cut down on word list load times so drastically (we were originally having issues with exponential memory usage with large word lists, hence the changes to pre-processing). I've opened issue #20 for adding additional unit tests to help us catch these slowdowns in the future.

A quick-and-dirty fix could be to just use the original algorithm when a word list is sufficiently small, and switch to the new memory-efficient algorithm only when necessary. In the meantime, I'm going to explore faster solutions for the comparison of strings with character variants.

A slice of this dataset should be good for benchmarking on our end. It contains 278677 user reviews for clothing products. Quick preliminary tests on a slice of 1000 reviews shows 0.7.0 is roughly 5x slower than 0.6.1 (excluding the load time for the word list).

Did some rough math and came up with the big-O bounds for our algorithm for 0.6.1 vs 0.7.0. Let n be the number of letters in a word, w be the number of words in a word list, and N be the maximum number of characters for a word in a word list.

Version memory lookup
0.6.1 O(w * 4^N) O(1)
0.7.0 O(w * N) O(n)
New algorithm O(w * N) O(n)

The word lookup for 0.6.1 was basically just a hash lookup in a default Python set, so it's going to be hard to beat. That said, the lookup efficiency for the algorithm I'm working on should essentially be a O(1) lookup in practice. This goes for documents with little to no curse words, or with lots curse words that are relatively short. But regardless, there will be some extra overhead versus 0.6.1. Whether this overhead will be significant is yet to be determined.

As a side note, it looks like we can get marginal speedups for dictionary look-ups (a reduction of about 17%) by interning strings, as demonstrated here.

Seeing some promising results with a new algorithm, which you can see in this branch. For benchmarking, I generated lorem ipsum paragraphs with 5% of the words replaced with profanity. The filters all use the same, default word list. I then ran the censor for each relevant version against different sets of paragraphs and timed them. Here's the runtimes in seconds, averaged over 4 runs:

Version 10 paragraphs 100 paragraphs 1000 paragraphs
0.6.1 0.46 4.82 48.43
0.7.0 2.02 20.72 212.85
New algorithm 0.01 0.10 1.01

Now, I expected this new algorithm to be slower than 0.6.1, so these times are a bit surprising. The trials did include assertion checks against known censored versions of the text, and my algorithm does pass all the unit tests. But given our past mistake of not testing 0.7.0 thoroughly enough, we should resolve #20 first so we have something more verified to test with.