besok/jsonpath-rust

Regex bad performance

catalinstochita opened this issue · 5 comments

Hey. I'm using this library at work and I have noticed some slowness when using regex, looks like creating a Regex is somehow slow. I have tested the theory using a global map for storing the compiled regex and it improved the speed ~50x. The implementation is not what we want, we should create the regex when creating the jsonpath and use it directly. I'm not sure how we can do that.

Here is the changes I made for testing:
https://github.com/catalinstochita/jsonpath-rust

Let me know if you have an idea of how we can do that. Thank you @besok

Hi, sure. I will take a look shortly.

Yeah, the regular expressions are heavy and the general advise is to avoid recompiling them as much as possible.

Thus, I can't see any other way but to solve it with something like a cache with the precompiled expressions.

So, essentially, the library can only provide the option to accept the regex cache and use it.

struct JsPathRegexCache {
    cache: Arc<Mutex<HashMap<String, Regex>>>,
}
pub struct JsonPathConfig {
    pub with_regex_cache: bool,
    pub regex_cache: JsPathRegexCache,
}

fn example(global_cache:JsPathRegexCache) {
    let json = Box::new(json!({}));
    let cfg = JsonPathConfig::with_cache(global_cache)
    let _v = (cfg, json).path("$.[?(@.author ~= '.*(?i)d\\(Rees\\)')]")
        .expect("the path is correct");
}

Obviously, it will improve the performance only in the long run. If the regexes are always different it will not help.

I scribbled an example of how it may be in this branch

In the given benchmark the duration gets boosted 5 times to roughly 25 µs on my laptop.
Looking into this thread where eventually the duration was 1,5 seems more or less plausible.

Let me now, if this approach works and I will implement it shortly.

Yes, is working fine. It would have been nice if serde_json::Value had a Regex field, I think in that case we would just added a "regex" rule and parse it directly, but this method works too

Yeah, great. I will implement it in a week or two.

It would have been nice if serde_json::Value had a Regex field, I think in that case we would just added a "regex" rule and parse it directly, but this method works too

Well, it is beyond this repo unfortunately but even if not to me it would be not clear how it would solve the problem assuming:

  • one Value can have different regexes to be parsed with
  • one regex can be applied to the different Value instances
  • this will transfer the given problem of having a cache from regexes to having a cache of Values eventually which seems to be less static and harder to maintain

But maybe, I just missed something.

So, the story:
Add a configuration for jspath which will accept the cache for regexes.
In the future, the configuration can be enriched with the other parameters.

The cache should be mutable(assuming refcell) and capable of working in multithreading (assuming Arc)
the regex cache should be used when the library calculates the regex operator and either create a new one and place it in the cache or take the existing one from the cache.
if you want, feel free to take it on. :)
I will be able to come back to it in a week or two.