waltheri/wgo.js

The LL(*) technique can simplify the SGF parser

Closed this issue · 3 comments

I've briefly looked at the SGF parser of wgo and thought that it can be dramatically simplified by rewriting it in a form of a recursive descent LL(*) parser. This way it could be written in just 5 lines of code:

https://github.com/d180cf/tsumego.js/blob/master/src/sgf.ts#L27-L45

Yes, it looks simpler, but I see that your code is written in TypeScript and uses some kind of external library. Will it be simpler and shorter after compilation?

TS is merely a way to add type annotations. The SGF parser refers to a file named rdp.ts which implements a few parsing primitives.The idea is very simple:

  1. First we implement a few basic parsers: a function that can parse a fixed string, a function that can parse a string matching a basic regexp and so on.
  2. Then we implement the .then or .map method that constructs a new parsing function which transforms the result, i.e. if xy is a parsing function and xy.exec("ca") returns "ca" then xy.map(s => ...).exec("ca") returns [2, 0].
  3. Finally we implement basic parser combinators, i.e. if a function called tag parses strings like ;FF[4] and returns an array like ["FF", 4] then tag.rep() or rep(tag) constructs a function that can parse strings like ;FF[4];B[cc];W[ab] and return arrays like [["FF"], 4], ["B", "cc"], ["W", "ab"]].

As you see, this way the small library of parsing primitives lives separately and is rarely changed, while the SGF parser uses those primitives to construct a specific SGF parser. If you're familiar with ABNF, you'll notice that the way the SGF parser is written with these parsing functions is pretty much identical to how it would be written in ABNF.

The price for such convenience of writing parsers is performance. The LL parsers are generally not very fast because a typical parsing rule combined of other rules via alternation, i.e. expr = term "+" term | term "*" term | ..., has to reparse the input from the very beginning if the current alternative didn't matrch the input. LR parsers don't have this problem, but they are much harder to write and debug and they have certain not obvious restrictions in how rules can be written. However those who really care about performance write parsers manually, like you did: GCC had a GLR parser which was rewritten with a bunch of tricky for/while loops because that was the only way to achieve acceptable speed.

Long ago I even wrote a lib that implements this idea in a more general way: https://github.com/c5f7c9/llkp (that time c5f7c9 was a hash of something I don't remember :)

I studied grammars at school but probably forget most of it ;-). However I think my parser is very similar to yours. Parser itself starts here, other code just transforms SGF properties into a better form - for example B[aa] will transform into {x: 1, y: 1, c: WGo.B}. To be clear it works like this:

  1. SGF is divided into sequence of nodes - (;FF[4]SZ[19] (;W[aa];B[ab]) (;B[cd])) => ["(", ";FF[4]SZ[19]", "(", ";W[aa]", ";B[ab]", ")", "(", ";B[cd]", ")", ")"], brackets mean new variation or its end.
  2. Each node is transformed into array of properties - ;FF[4]SZ[19] => ["FF[4]", "SZ[19]"].
  3. Each property is transformed into array of values - AB[ce][dd][ff] => ["[ce]", "[dd]", "[ff]"], and an identifier is extracted - AB[ce][dd][ff] => AB.
  4. Then I take these data and make them pretty.

It seems to me, that my parser is longer and less elegant because it doesn't use any helper functions. However if you rewrite your parser into javascript, I would be willing to use it, if it will be shorter.