UTF-8 added words not being detected
Closed this issue · 3 comments
The following code does not work
use rustrict::{CensorStr, Type};
use rustrict::add_word;
fn main() {
#[cfg(feature = "customize")]
{
unsafe {
add_word("плохоеслово", Type::PROFANE & Type::SEVERE);
}
}
let inappropriate = "hello плохоеслово".is_inappropriate();
println!("{}", inappropriate); // false
}
Same with English chars does work
use rustrict::{CensorStr, Type};
use rustrict::add_word;
fn main() {
#[cfg(feature = "customize")]
{
unsafe {
add_word("badword", Type::PROFANE & Type::SEVERE);
}
}
let inappropriate = "hello badword".is_inappropriate();
println!("{}", inappropriate); // true
}
Also, is there a way to massively add new words?
Or maybe somehow extend the default one.
Context
I am using latest rustrict
version (0.5.10).
Thanks for the issue! I've changed the character replacement strategy to allow matching certain non-ASCII characters. Your example now works in version 0.5.11
.
Also, is there a way to massively add new words?
You can call add_word
as many times as you want. #6 did ask for a more ergonomic API, and it is something I'm considering :)
Thanks, I can confirm that everything works fine now in 0.5.11
.
About the add_word
, is there will be any performance impact if I call it, let's say ... 10 thousand times?
I just want to use rustrict
in my telegram chat bot as a profanity filter module.
And I need some kind of easy-to-use way of extending the default dictionary.
About the
add_word
, is there will be any performance impact if I call it, let's say ... 10 thousand times?
There are two parts to the performance impact:
- Calling
add_word
that many times will add a small, possibly negligible delay at startup - Future filter performance will be slightly worse (I'm guessing around 25% slower), because using more memory means a smaller proportion will fit in the CPU cache. It will not be 10,000 times slower than if there was only 1 word (the filter never iterates all words in the wordlist).