A string tokenizer library for Rust, where characters used to separate tokens may also be conditionally selected to be a token themselves.
There are filter implementations provided for a few basic use cases:
use tokesies::*;
let line = "hello!world, this is some_text";
let tokens = FilteredTokenizer::new(filters::DefaultFilter{}, line).collect::<Vec<Token>>();
// tokens: ["hello", "!", "world", ",", "this", "is", "some", "_", "text"]
assert_eq!(tokens.get(0).unwrap().term(), "hello");
You can alternatively provide a custom implementation:
use tokesies::*;
pub struct MyFilter;
impl filters::Filter for MyFilter {
fn on_char(&self, c: &char) -> (bool, bool) {
match *c {
' ' => (true, false),
',' => (true, true),
_ => (false, false),
}
}
}
let line = "hello!world, this is some_text";
let tokens = FilteredTokenizer::new(MyFilter{}, line).collect::<Vec<Token>>();
// tokens: ["hello!world", ",", "this", "is", "some_text"]
assert_eq!(tokens.get(0).unwrap().term(), "hello!world");
Implementation is derived largely from this blog by @daschl.
Contributions are very welcome, just fork and submit a pull request.