Find a way to analyze regex besides Rhino
mattbishop opened this issue · 6 comments
I understand there are differences between the regex in JS and in Java, but it would be good to figure out if they are meaningful in light of json-schema. Loading up a Rhino context to use simple patterns is, well, heavy.
There are quite a lot of differences in regex dialect themselves. Plus, there is only one context loaded for all the duration of the program, really: the one which holds the necessary functions.
What is more, some commonly used regex idioms do not operate the same at all: witness \w
, which only recognizes ASCII letters in Java, but all Unicode letters in JavaScript.
Prior to using Rhino, the code was using JDK's builtin ScriptEngine
, which is both heavier and slower. I didn't really have a choice here in order to be compliant ;)
I'm looking at using the json-schema validator in a highly-threaded, performance-sensitive environment. I may need to work on this bit because pattern validation is important for us.
Here is what I found to compare ECMA and Java:
http://www.regular-expressions.info/refflavors.html
I also want to look carefully at exactly how patterns are defined in the json-schema spec. In my experience with XML Schema, a document spec usually narrows the requirements of patterns to facilitate cross-platform implementations:
http://json-schema.org/latest/json-schema-validation.html#3.3
OK, to be fully honest I did make peformance tests of the current code but did not try and compare against an implementation using java.util.regex, but I do suspect it would be faster. But correctness is my first goal ;)
Keywords using regexes in JSON Schema are pattern
and patternProperties
. There is a way to completely bypass the use of Rhino if you use pattern
: you have to write a customized syntax checker and keyword validator using java.util.regex and register them using a custome ValidationConfiguration
. For patternProperties
, it is unfortunately not possible unless you preload your own version of ObjectSchemaSelector
.
Also, disclaimer: I am the author of the validation spec; the section you linked to is my writing and I wrote it precisely to allow for maximum interoperability. But a SHOULD is not a MUST, and most validators out there use their native regex engines which don't comply with ECMA 262.
As I already mentioned, it is purely for correctness purposes that I use Rhino at the moment; it is also because I could not find any faster ECMA 262 regex library. I know Java 8 will use a new JS engine (I cannot remember its name offhand at the moment) which is supposedly faster, but I don't know whether it would be available as a separate library to use with Java 7 and below.
OK, closing this one. Won't be doable for 1.2.x. I intend to be able to do this in 2.0 though.
Rhino is huge, https://github.com/jruby/joni is also fast and compliant