A proc macro regex library to match an arbitrary string or byte array to a regular expression.
Add this to your Cargo.toml
:
[dependencies]
proc-macro-regex = "~1.1.0"
The macro regex!
creates a function of the given name which takes a string or byte array and
returns true
if the argument matches the regex, otherwise false
.
use proc_macro_regex::regex;
/// Create the function with the signature:
/// fn regex_email(s: &str) -> bool;
regex!(regex_email "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$");
fn main () {
println!("Returns true == {}", regex_email("example@example.org"));
println!("Returns false == {}", regex_email("example.example.org"));
}
The given regex works the same as in the regex crate. If the ^
is at the beginning of the regex and $
at the end then the whole string is checked, otherwise is
check if the string contains the regex.
The macro creates a deterministic finite automaton (DFA), which parse the given input. Depending on the size of the DFA or the character of the regex, a lookup table or a code base implementation (binary search) is generated. If the size of the lookup table would be bigger than 65536 bytes (can be changed) then a code base implementation (binary search) is used. Additionally, if the regex contains any Unicode (no ASCII) character then a code base implementation (binary search) is used, too.
The following macro generates the following code:
regex!(example_1 "abc");
Generates:
fn example_1(s: &str) -> bool {
static TABLE: [[u8; 256]; 3usize] = [ ... ];
let mut state = 0;
for c in s.bytes() {
state = TABLE[state as usize][c as usize];
if state == u8::MAX {
return true;
}
}
false
}
To tell the macro that the lookup table is not allowed to be bigger than 256 bytes, a third argument can be given. Therefore, a code base implementation (binary search) of the DFA is generated.
regex!(example_2 "abc" 256);
Generates:
fn example_2(s: &str) -> bool {
let mut state = 0;
for c in s.bytes() {
state = if state < 1usize {
match c {
97u8 => 1usize,
_ => 0usize,
}
} else {
if state == 1usize {
match c {
97u8 => 1usize,
98u8 => 2usize,
_ => 0usize,
}
} else {
match c {
97u8 => 1usize,
99u8 => return true,
_ => 0usize,
}
}
};
}
false
}
To change the visibility of the function, add the keywords at the beginning of the arguments.
regex!(pub example_2 "abc" 256);
Generates:
pub fn example_3(s: &str) -> bool {
// same as in example_1 (see above)
}
To parse a byte array instead of string, pass a byte string.
regex!(example_4 b"abc");
Generates:
fn example_4(s: &[u8]) -> bool {
// same as in example_1 (see above)
}
The generated code should work with #![no_std]
, too.
Advantages:
- Compile-time (no runtime initialization, no lazy-static)
- Generated code that does not contain any dependencies
- No heap allocation
- Approximately 12%-68% faster for no trivia regex 1
Disadvantages:
- Currently, no group captures
- No runtime regex generation
This is the performance comparison between this crate and the regex crate. If you want to test it
by yourself, run cargo bench --bench compare
.
Name | proc-macro-regex |
regex |
Result |
---|---|---|---|
743.95 MiB/s | 441.67 MiB/s | 68.44 % | |
URL | 584.62 MiB/s | 519.00 MiB/s | 12.64 % |
IPv6 | 746.92 MiB/s | 473.38 MiB/s | 57.78 % |
This was compiled with rustc 1.53.0-nightly (392ba2ba1 2021-04-17)
.
This project is licensed under the BSD-3-Clause license.
Any contribution intentionally submitted for inclusion in proc-macro-regex
by you, shall
be licensed as BSD-3-Clause, without any additional
terms or conditions.
Footnotes
-
It were tested with regex in
benches/compare.rs
. For pattern/word matching it is slower because the regex library uses aho-corasick. (See Performance) ↩