snguyenthanh/better_profanity

Incorrect processing result for keywords having symbols

brandonbai opened this issue · 4 comments

Use word "s&m", "s & m", "2 girls 1 cups" ... to run profanity.censor with the default config got the incorrect result.
for example:

print(profanity.censor("s & m"))
# s & m

why ?

Thank you for reporting the issue.

I'm in the first stage of troubleshooting the problem. It seems to be caused by function update_next_words_indices, which returns a wrong list of next words to be parsed.

I will keep this issue updated when I have any new findings.

From my side, 2 girls 1 cups returns the correct result.

s & m seems to be caused by update_next_words_indices, which doesn't create the expected list of words, due to the character &.

Take hello 123 as an example:

  1. how the library works is, when the a word is identified (hello), it checks for if any continuous combination of it and the following word(s) forms a swear word in the wordlist.
  2. What function update_next_words_indices does is, returning a list of following words starting from the current one found. So in this sample it will return a List ['123', ' 123']

However, for s & m, the & character is specified as a separated value (just as , and ), instead of being grouped into the List of following words from update_next_words_indices.

As I'm very busy with my study in this period, I won't be able to fix this bug anytime soon in ~1 month.
Please feel free to create a PR for this.

This is considered a major development for the library, which I wouldn't be able to do this in the near future, due to a tight schedule as a last-year student.

A suggestion on how to fix is to create a separated wordlist for special words, ones with separators different than an empty space ' ' and requires the separator(s) to have an exact match (such as s & m).
While parsing the text, if the current word and next word(s) matches a set of words in the special wordlist, return True if the separator is also identical to return True; otherwise, return False.

Can't you just run the check on the text first, then if there is no detect, use regex to remove duplicates, and try again?