goodmami/python-parsing-benchmarks

Add textX benchmark

ThatXliner opened this issue · 9 comments

https://github.com/textX/textX seems pretty promising

Seems like a good idea. It seems like textX is just a different metasyntax on top of Arpeggio, so I wonder if we should just use Arpeggio instead? Then again, textX has more stars on GitHub so maybe it's more familiar.

@ThatXliner, care to provide an implementation? There are JSON examples for both textX and Arpeggio. They just need to be adapted to follow the JSON spec more closely (particularly regarding strings) and wrapped such that they can be timed.

I can try this Saturday! IMO, I think you should put the name like textX (Arpeggio).

I can try this Saturday!

Great, thanks!

I think you should put the name like textX (Arpeggio).

That's fine if you use textX. My point is that you can apparently use Arpeggio without textX, either with a PEG-like metasyntax or with Python code (like PyParsing, in a way). Writing Python code would most certainly make it fast to instantiate a parser, but I don't think it would affect the actual parsing performance.

When you mean "top level element", do you mean like the first element? And if single values are allowed as a top-level element, do you mean

1

Should be counted as valid?

When you mean "top level element", do you mean like the first element?

Are you referring to the JSON task document? If so, then I mean that a JSON document consists of a single value, and that value may be an object ({...}), an array ([...]), a string, a number, true, false, or null. This is following the JSON specification.

And if single values are allowed as a top-level element, do you mean

1

Should be counted as valid?

Yes. This is how Python's JSON parser works, as well:

>>> import json
>>> json.loads('1')
1

Looking at the textX example, that means the top-level rule, at least, needs to be expanded:

 File:
-     Array | Object
+    Value
;

Some more work will need to be done, such as redefining STRING and BOOL and maybe FLOAT so they fit more tightly to the JSON spec. They are "primitive" patterns defined here:

https://github.com/textX/textX/blob/ac99d92da2d9a5c5d85cf3ffaacb1779b4a5a0c2/textx/lang.py#L91-L99

For instance, 01 is not a valid integer in JSON, but the INT rule in textX seems to allow it. You can check the examples in this repository for valid patterns. For instance, in Lark:

STRING: "\"" INNER* "\""
INNER: /[ !#-\[\]-\U0010ffff]*/
| /\\(?:["\/\\bfnrt]|u[0-9A-Fa-f]{4})/
NUMBER : INTEGER FRACTION? EXPONENT?
INTEGER: ["-"] ("0" | "1".."9" INT?)
FRACTION: "." INT
EXPONENT: ("e"|"E") ["+"|"-"] INT

Or in SLY:

@_(r'-?(0|[1-9][0-9]*)(\.[0-9]+)?([Ee][+-]?[0-9]+)?')
def NUMBER(self, t):
t.value = float(t.value)
return t
@_(r'"([ !#-\[\]-\U0010ffff]+|\\(["\/\\bfnrt]|u[0-9A-Fa-f]{4}))*"')
def STRING(self, t):
t.value = json_unescape(t.value)
return t

For the lark parser, you don't need to specify all those white spaces. You can just %ignore it Wait. There's like a special Whitespace requirement, right?

We tweaked the Lark one to try and make it faster (see 2ddd8ce and lark-parser/lark#487 (comment)). We're not completely sold that it's a good idea, because it makes the grammar harder to read for only a 5% speedup.

The whitespace requirement for the JSON parser is just as the spec dictates. Whitespace is allowed and ignored before and after any value and object key. Unescaped newlines are not allowed inside strings.

Now that #5 is merged and the times are included in the README, I think we can close this issue. Thanks again for the help!