java-json-tools/json-schema-core

Find a way to analyze regex besides Rhino

mattbishop opened this issue · 6 comments

I understand there are differences between the regex in JS and in Java, but it would be good to figure out if they are meaningful in light of json-schema. Loading up a Rhino context to use simple patterns is, well, heavy.

fge commented

There are quite a lot of differences in regex dialect themselves. Plus, there is only one context loaded for all the duration of the program, really: the one which holds the necessary functions.

What is more, some commonly used regex idioms do not operate the same at all: witness \w, which only recognizes ASCII letters in Java, but all Unicode letters in JavaScript.

Prior to using Rhino, the code was using JDK's builtin ScriptEngine, which is both heavier and slower. I didn't really have a choice here in order to be compliant ;)

I'm looking at using the json-schema validator in a highly-threaded, performance-sensitive environment. I may need to work on this bit because pattern validation is important for us.

Here is what I found to compare ECMA and Java:

http://www.regular-expressions.info/refflavors.html

I also want to look carefully at exactly how patterns are defined in the json-schema spec. In my experience with XML Schema, a document spec usually narrows the requirements of patterns to facilitate cross-platform implementations:

http://json-schema.org/latest/json-schema-validation.html#3.3

fge commented

OK, to be fully honest I did make peformance tests of the current code but did not try and compare against an implementation using java.util.regex, but I do suspect it would be faster. But correctness is my first goal ;)

Keywords using regexes in JSON Schema are pattern and patternProperties. There is a way to completely bypass the use of Rhino if you use pattern: you have to write a customized syntax checker and keyword validator using java.util.regex and register them using a custome ValidationConfiguration. For patternProperties, it is unfortunately not possible unless you preload your own version of ObjectSchemaSelector.

fge commented

Also, disclaimer: I am the author of the validation spec; the section you linked to is my writing and I wrote it precisely to allow for maximum interoperability. But a SHOULD is not a MUST, and most validators out there use their native regex engines which don't comply with ECMA 262.

As I already mentioned, it is purely for correctness purposes that I use Rhino at the moment; it is also because I could not find any faster ECMA 262 regex library. I know Java 8 will use a new JS engine (I cannot remember its name offhand at the moment) which is supposedly faster, but I don't know whether it would be available as a separate library to use with Java 7 and below.

fge commented

OK, closing this one. Won't be doable for 1.2.x. I intend to be able to do this in 2.0 though.

Rhino is huge, https://github.com/jruby/joni is also fast and compliant