kennytm/rand_regex

Support for recursive regex

LoZack19 opened this issue · 2 comments

I need to generate a tree in json. A tree is an inherently recursive structure, so I wanted to use recursive regex groups to achieve this. However, no text generation library using regexes in any ecosystem seems to support this feature. I realize this is a lot to ask, but it would be really very nice to be able to do this. Right now the only solution is to recursively generate a non-recursive regex, but the compile times for large regexes are biblical.

What I want

<node> = {"value":<value>,"children":[<node>,...,<node>]}

How I want to do be able to do it

let regex = Regex::compile(
    r#"(?>{"value":"\w*","children":\[(?>(?>(?R),)*(?R))?\]})"#,
    5,
)
.unwrap();
let mut rng = rand::thread_rng();
let rand_tree = rng.sample::<String, _>(&regex);
println!("{rand_tree}");

How I must do in order to achieve this at the moment

fn json_node(depth: usize, value_repr: &str) -> String {
    if depth == 0 {
        format!("\\{{\"value\":({value_repr}),\"children\":\\[\\]\\}}")
    } else {
        let node = json_node(depth - 1, value_repr);
        format!(
            "\\{{\"value\":({value_repr}),\"children\":\\[((({node}),)*({node}))?\\]\\}}",
        )
    }
}

fn sample_json_random_tree(depth: usize) -> String {
    const VALUE: &str = "\"\\w+\"";
    let json_tree_gen = LazyCell::new(|| Regex::compile(&json_node(depth, VALUE), 5).unwrap());

    rand::thread_rng().sample::<String, _>((*json_tree_gen).borrow())
}

I think randomly generating the structure struct Node { value: String, children: Vec<Node> } and then serialize it to JSON makes more sense for your use case than condense the entire procedure into a cryptic PCRE.

Supporting (?R) makes it impossible to simply reuse regex-syntax, and even fancy-regex does not mention it, so it's pretty much out-of-scope.

I have already written code to generate a random tree, but that is not interesting, because it will always generate a valid tree. I wanted do perform some smart fuzzing to test the solidity of my deserialize implementation. I understand that this is too difficult of a problem to tackle, though. But it would be nice to have a tool in general for doing such things in the future.