Incorrect processing result for keywords having symbols
brandonbai opened this issue · 4 comments
Use word "s&m", "s & m", "2 girls 1 cups" ... to run profanity.censor
with the default config got the incorrect result.
for example:
print(profanity.censor("s & m"))
# s & m
why ?
Thank you for reporting the issue.
I'm in the first stage of troubleshooting the problem. It seems to be caused by function update_next_words_indices, which returns a wrong list of next words to be parsed.
I will keep this issue updated when I have any new findings.
From my side, 2 girls 1 cups
returns the correct result.
s & m
seems to be caused by update_next_words_indices, which doesn't create the expected list of words, due to the character &
.
Take hello 123
as an example:
- how the library works is, when the a word is identified (
hello
), it checks for if any continuous combination of it and the following word(s) forms a swear word in the wordlist. - What function update_next_words_indices does is, returning a list of following words starting from the current one found. So in this sample it will return a List ['123', ' 123']
However, for s & m
, the &
character is specified as a separated value (just as ,
and
), instead of being grouped into the List of following words from update_next_words_indices.
As I'm very busy with my study in this period, I won't be able to fix this bug anytime soon in ~1 month.
Please feel free to create a PR for this.
This is considered a major development for the library, which I wouldn't be able to do this in the near future, due to a tight schedule as a last-year student.
A suggestion on how to fix is to create a separated wordlist for special
words, ones with separators different than an empty space ' '
and requires the separator(s) to have an exact match (such as s & m
).
While parsing the text, if the current word and next word(s) matches a set of words in the special
wordlist, return True
if the separator is also identical to return True
; otherwise, return False
.
Can't you just run the check on the text first, then if there is no detect, use regex to remove duplicates, and try again?