Add textX benchmark
ThatXliner opened this issue · 9 comments
https://github.com/textX/textX seems pretty promising
Seems like a good idea. It seems like textX is just a different metasyntax on top of Arpeggio, so I wonder if we should just use Arpeggio instead? Then again, textX has more stars on GitHub so maybe it's more familiar.
@ThatXliner, care to provide an implementation? There are JSON examples for both textX and Arpeggio. They just need to be adapted to follow the JSON spec more closely (particularly regarding strings) and wrapped such that they can be timed.
I can try this Saturday! IMO, I think you should put the name like textX (Arpeggio)
.
I can try this Saturday!
Great, thanks!
I think you should put the name like
textX (Arpeggio)
.
That's fine if you use textX. My point is that you can apparently use Arpeggio without textX, either with a PEG-like metasyntax or with Python code (like PyParsing, in a way). Writing Python code would most certainly make it fast to instantiate a parser, but I don't think it would affect the actual parsing performance.
When you mean "top level element", do you mean like the first element? And if single values are allowed as a top-level element, do you mean
1
Should be counted as valid?
When you mean "top level element", do you mean like the first element?
Are you referring to the JSON task document? If so, then I mean that a JSON document consists of a single value, and that value may be an object ({...}
), an array ([...]
), a string, a number, true
, false
, or null
. This is following the JSON specification.
And if single values are allowed as a top-level element, do you mean
1
Should be counted as valid?
Yes. This is how Python's JSON parser works, as well:
>>> import json
>>> json.loads('1')
1
Looking at the textX example, that means the top-level rule, at least, needs to be expanded:
File:
- Array | Object
+ Value
;
Some more work will need to be done, such as redefining STRING
and BOOL
and maybe FLOAT
so they fit more tightly to the JSON spec. They are "primitive" patterns defined here:
https://github.com/textX/textX/blob/ac99d92da2d9a5c5d85cf3ffaacb1779b4a5a0c2/textx/lang.py#L91-L99
For instance, 01
is not a valid integer in JSON, but the INT
rule in textX seems to allow it. You can check the examples in this repository for valid patterns. For instance, in Lark:
python-parsing-benchmarks/bench/lark/json.py
Lines 32 to 39 in 611cfc2
Or in SLY:
python-parsing-benchmarks/bench/sly/json.py
Lines 21 to 29 in 611cfc2
For the lark parser, you don't need to specify all those white spaces. You can just Wait. There's like a special Whitespace requirement, right?%ignore
it
We tweaked the Lark one to try and make it faster (see 2ddd8ce and lark-parser/lark#487 (comment)). We're not completely sold that it's a good idea, because it makes the grammar harder to read for only a 5% speedup.
The whitespace requirement for the JSON parser is just as the spec dictates. Whitespace is allowed and ignored before and after any value and object key. Unescaped newlines are not allowed inside strings.