UTF-8 added words not being detected

Question

UTF-8 added words not being detected

Closed this issue 2 years ago · 3 comments

The following code does not work

use rustrict::{CensorStr, Type};
use rustrict::add_word;

fn main() {
    #[cfg(feature = "customize")]
    {
        unsafe {
            add_word("плохоеслово", Type::PROFANE & Type::SEVERE);
        }
    }

    let inappropriate = "hello плохоеслово".is_inappropriate();
    println!("{}", inappropriate); // false
}

Same with English chars does work

use rustrict::{CensorStr, Type};
use rustrict::add_word;

fn main() {
    #[cfg(feature = "customize")]
    {
        unsafe {
            add_word("badword", Type::PROFANE & Type::SEVERE);
        }
    }

    let inappropriate = "hello badword".is_inappropriate();
    println!("{}", inappropriate); // true
}

Also, is there a way to massively add new words?
Or maybe somehow extend the default one.

Context

I am using latest rustrict version (0.5.10).

Answer 1 · 2023-01-16T19:08:01.000Z

Thanks for the issue! I've changed the character replacement strategy to allow matching certain non-ASCII characters. Your example now works in version 0.5.11.

Also, is there a way to massively add new words?

You can call add_word as many times as you want. #6 did ask for a more ergonomic API, and it is something I'm considering :)

Answer 2 · 2023-01-16T20:02:06.000Z

Thanks, I can confirm that everything works fine now in 0.5.11.

About the add_word, is there will be any performance impact if I call it, let's say ... 10 thousand times?
I just want to use rustrict in my telegram chat bot as a profanity filter module.
And I need some kind of easy-to-use way of extending the default dictionary.

Answer 3 · 2023-01-16T21:36:21.000Z

About the add_word, is there will be any performance impact if I call it, let's say ... 10 thousand times?

There are two parts to the performance impact:

Calling add_word that many times will add a small, possibly negligible delay at startup
Future filter performance will be slightly worse (I'm guessing around 25% slower), because using more memory means a smaller proportion will fit in the CPU cache. It will not be 10,000 times slower than if there was only 1 word (the filter never iterates all words in the wordlist).