/gllc

Write your generalized parser combinator in 60 lines and extend it.

Primary LanguagePython

gllc

A generalized-LL(1) parser combinator core can be written within 60 lines.

Based on the core and additional helper object Parser, a simple parser for the examplar non-LL(1) grammar S → a a | a S a can be defined as

import gllc as gl

n   = gl.Parser()

n.a = gl.Token('a')

n.s = n.a * n.a |\
      n.a * n.s * n.a

>>> list(n.s('aaaaaa'))
[(('a', 'a'), 'aaaa'),
 ((('a', ('a', 'a')), 'a'), 'aa'),
 ((('a', (('a', ('a', 'a')), 'a')), 'a'), '')]

A few points here:

  • The recursive access to the attribute n.s is supported by a lazy strategy,
  • Combinator operations like *, | are binary and left associative.
  • Parser namespace is managed by the object n pretty neatly. You can even combine parsers from different namespaces.
  • Generalized parser normally delivers all partial parses i.e. ambiguous results.

This README mainly talks about the intrinsic relationship between parsing and logical inference, and how simple it is to use coroutines for the implementation. It also shows an interesting trick to simulate the laziness to make the parser support recursive symbols.

A tiny combinator core

Parser combinators are handy tools to do ad-hoc parsing work. The most famous library for that is the Haskell Parsec library. However, Parsec disables the generalized mode by default (see semantics of the operator <|> and function try in related documentation), which means the user need to do extensive left-factoring work to get the grammar working. This could lead to the loss of intuition supporting the grammar design (sometimes we just want to play with the grammar tolerating some level of ambiguity).

However, it is not uneasy to build a generalized parsing core from scratch. Here we use Python for a concise implementation benefitting from the interesting features:

  • generator
  • operator overriding
  • lazy access to object attributes

and that's all.

Analogue of logic expressions

Parsing is no much different than proving logical expressions like AND/OR. For example with the BNF-style grammar

S → A B C
  | D

it means we can either prove S after proving A first, then B and finally C, or go another way by just proving D. Any success from the both ways means S is proved. Analogically, we can see the grammar as definite clauses like the following

S <= A & B & C
S <= D

where &, |, <= mean logical and, or and implied-by. Symbols like A, B and etc. are seen as goals of logical inference. By the way, any symbol such as A, B etc. can be terminal/atomic like a logical literal, which can be tested directly for its trueness.

Input as resource for reasoning

We see an input string as resource. What parsing goes further from logical reasoning is that, proving a goal "declaims" (i.e. consumes) a prefix of the resource (the recognized part) thus delivers a split. Subsequent provings can only make use of the residual resource after splitting.

Note there may be more than one possible ways of declaiming for splits. A generalized parser should yields all possible splittings. A split is then exactly a parse result as a pair, where the recognized part may have been tranformed (consumed as) a different type other than the string type:

Result : (<recognized-part>, <residual-part>)

and each parser, as a function, accepts an input string and yields a sequence of results, i.e.

Parser : <input> -> Seq Result

and empty sequence if there are no successful splitting at all.

Implementation

Based on the brief theory above, we can start thinking about the implementation. The very first object definition would be the most fundemantal PExpr, i.e. the parser expression. It can be defined as an abstract class supporting AND and OR operations as explained in the previous section (which are simulated by operation __mul__ and __or__). The __truediv__ and __pow__ operators are used for semantic applications (reasons for these choices will be explained later). The constructor accepts sub-expression arguments which are the operands of the expression.

Each subtype of PExpr can support the __call__ method which accepts input strings and yield parsing results.

class PExpr(object):
    def __init__(self, *subs):
        self.subs = subs
    def __mul__(self, other):
        return And(self, other)
    def __or__(self, other):
        return Or(self, other)
    def __truediv__(self, other):
        return Seman(self, other)
    def __pow__(self, other):
        return Seman(And(self, other), lambda tp: [tp[0]] + tp[1])
    def __call__(self, inp):
        raise NotImplemented

The Token parser is our first concret parser. It tries to match the prefix of the input and yields a split, or yields nothing when matching failed.

class Token(PExpr):
    def __call__(self, inp):
        lit, = self.subs
        if inp.startswith(lit):
            yield lit, inp[len(lit):]

The And expression extending PExpr having its left and right operands, namely psr1 and psr2 and And now defines a new parser. It calls the parsing method of its operands recursively, where the results of psr2 is computed based on the results of psr1. One detail here is that the Python * operator is left-associative (like most oeprators) so that the left sub-expression of And may itself be an And. Thus the resulted tuple (r1, r2) has r1 as a nested tuple within chained And expressions.

The reason for choosing __mul__ to represent the And operation is that, tuple is treated as product type in relevant theories (cf. Algebraic Data Type).

class And(PExpr):
    def __call__(self, inp):
        psr1, psr2 = self.subs
        for r1, inp1 in psr1(inp):
            for r2, inp2 in psr2(inp1):
                yield (r1, r2), inp2

For the Or expression, the parsing yields results from both its operands, and results from either branch don't depend on the other branch. You can even think about submitting the parsing task of each operands to a parallel execution context.

class Or(PExpr):
    def __call__(self, inp):
        for psr in self.subs:
            yield from psr(inp)

Then we need the very famous Kleene-closure combinator, which can be implemented using a loop. From the beginning on, the parsing prepares an agenda for tracking feasible results. In each iteration, it checks the agenda containing feasible splits, and tries to make new splits by consuming another sequence of input. Once no new splits are found, it yields all results in the agenda. This is an eager matching scheme since with "Many" it means "consuming as much resource as possible to accumulate valid parsing splits".

class Many(PExpr):
    def __call__(self, inp):
        psr, = self.subs
        agd = [([], inp)]
        while 1:
            agd1 = []
            for rs, inp in agd:
                for r1, inp1 in psr(inp):
                    agd1.append((rs+[r1], inp1))
            if agd1: agd = agd1
            else: break
        yield from agd

For convenience, we can define Many1 which means repetitive parsing of the PExpr should have at least one split set found.

class Many1(PExpr):
    def __call__(self, inp):
        psr, = self.subs
        m = Many(psr)
        for r, inp1 in psr(inp):
            for rs, inp2 in m(inp1):
                yield [r] + rs, inp2

Now we already have all combinators to handle grammars in most cases. You can play with the parsers already:

>>> ab = Many(Token('a') | Token('b'))
>>> next(ab('babbac'))
(['b', 'a', 'b', 'b', 'a'], 'c')

About the semantics

Simply parsing is always not enough - interpretation is almost always necessary when working with practical data transformation. Now we easily define the Seman expression to indicate interpretation:

class Seman(PExpr):
    def __call__(self, inp):
        psr, func = self.subs
        for r, inp1 in psr(inp):
            yield func(r), inp1

It is a binary operation with two operands: a parser psr and a function func. All it needs is to deliver functioned results based on the results delivered by psr. It is also interesting that Seman objects can be nested. For example, Seman(Seman(p1, f1), f2) just pipelines the results of p1 to f1 then further to f2.

Since we already overloaded operator __truediv__ for interpretation (/ is chosen since product type tuple may get divided into unary parts and get handled somehow), we can use it like:

>>> r = Token('a') / str.upper
>>> next(r('ac'))
('A', 'c')

In case of handling results from Many, the semantic function accepts a list as its argument:

>>> p = Token('0') | Token('1')
>>> r = Many(p / int) / (lambda xs: [x for x in xs if x > 0])
>>> next(r('101012'))
([1, 1, 1], '2')

How to be lazy?

One big problem of using combinators in practice is that most languages are strict when constructing data structures, that is, every argument for a operation must be evaluated before evaluating the operation itself.

For example, given the grammar:

a → 'a'
S → a a
  | a S a

we can not write the combinator in Python like

a = Token('a')
S = a * a |\
    a * S * a

since S needs to be evaluated on the right hand side sub-expression a * S * a before it is actually defined. In a lazy language like Haskell, this won't be a problem since the evaluation does not happen when contructing an expression.

But we don't want to lose our concise way of representing the grammar. One approach is to use the lambda structure for lazy look-up of S in the globals environment:

a = Token('a')
S = lambda inp: \
    (a * a | a * S * a)(inp)

which works but it is for sure annoying to repeatly type lambda inp: when playing with your funny language.

Here we have several possible tricks.

A lazy object

First we define the LazyExpr class. An lazy expression is defined in a given context (aka. namespace as well as environment), which is prepared in advance.

class LazyExpr(PExpr):
    def __init__(self, name, context):
        self.name = name
        self.context = context
    def __call__(self, *args):
        return self.context[self.name](*args)

Note such expression is logically also a subtype of PExpr and is callable. Now the grammar above can be written as

# Preparing context and construct
ctx = {}
S = LazyExpr('S', ctx)

# Making rules
a = Token('a')
S = ctx['S'] = a * a |\
               a * S * a

Now calling the combinator seems neater:

>>> list(S('aa'))
[(('a', 'a'), '')]

>>> list(S('aaaaaa'))
[(('a', 'a'), 'aaaa'),
 ((('a', ('a', 'a')), 'a'), 'aa'),
 ((('a', (('a', ('a', 'a')), 'a')), 'a'), '')]

And even simpler?

It still seems annoying to always write the ctx['S'] stuff. To get rid of this, we finally come to the last resort - overriding built-in operations. The class Parser is supposed to be a manager of the context of the lazy expressions, subtyping dict:

class Parser(dict):

    def __getattr__(self, k):
        return LazyExpr(k, self)

    def __setattr__(self, k, v):
        if k not in self:
            self[k] = v
        else:
            self[k] |= v

In short, with the overrider __getattr__, any access to an non-existing left-hand-side symbol k as an attribute leads to the creation of a lazy expression given a symbol k. With __setattr__ a symbol k is bound to a right-hand-side expression (or augments the existing expression via an Or expression).

Now the S-grammar above can be written as

n = Parser()

a = Token('a')
n.s = a * a | a * n.s * a

which although seems not super ideal, but concise enough for practical use. More interesting here is, Parser provides extra policy of managing combinators within some namespaces, which can be potentially useful if you think about defining different grammars in separate modules and combining them together for use. Say you have a parser n and a parser m, you can now define a parser like my_parser = n.s * m.t * n.w.

Utilities

To make ad-hoc things easier, we prepare two functions fst and snd in order to select useful component of any recognized parsing result yielded by an And parser (which is a always an 2-tuple).

from operator import itemgetter
fst = itemgetter(0)
snd = itemgetter(1)

Utility combinators like the following can be useful too:

class OneOf(PExpr):
    def __init__(self, alts):
        self.alts = set(alts)
    def __call__(self, inp):
        if inp and inp[0] in self.alts:
            yield inp[0], inp[1:]

class NoneOf(PExpr):
    def __init__(self, alts):
        self.alts = alts
    def __call__(self, inp):
        if inp and inp[0] not in self.alts:
            yield inp[0], inp[1:]


White = Many(OneOf(' \t\n'))
Digit = OneOf('0123456789')
Alpha = OneOf(map(chr, [*range(65, 91), *range(97, 123)]))
AlphaNum = Alpha | Digit

A quite useful operator ** (see its overloader in PExpr class) can help with concatenating a single element with a following list delivered by Many or Many1. For example, a list of integers seperated by comma can be handled like:

Digits = Many(Digit) * White / fst / ''.join
int1 = Digits / int
ints = int1 ** Many(Word(',') * int1 / snd)

>>> next(ints('123, 45, 987'))
([123, 45, 987], '')

Even more operator sugar can be provided via overloading __rshift__ and __lshift__ in PExpr based on fst and snd, meaning ignoring a followed or following part:

class PExpr(object):
    ...
    def __lshift__(self, other):
        return Seman(And(self, other), fst)
    def __rshift__(self, other):
        return Seman(And(self, other), snd)

Digits = (Many(Digit) << White) / ''.join
int1 = Digits / int
ints = int1 ** Many(Word(',') >> int1)

>>> next(ints('123, 45, 987'))
([123, 45, 987], '')

Altogether now

Finally, here is an example grammar for basic arithmetics.

e = Parser()

def prod(xs):
    p = 1
    for x in xs: p *= x
    return p

e.expr   = e.term ** Many(Word('+') >> e.term) / sum
e.term   = e.factor ** Many(Word('*') >> e.factor) / prod
e.factor = e.number |\
           Word('(') >> e.expr << Word(')')
e.number = (Many1(Digit) << White) / ''.join / int

And that's all. The user can extend the parsers with whatever new combinator according to the needs.

TODO

There are potential improvements considering performance and memory usage. For example

  • Use indices rather than making string slices to represent residual inputs.
  • Use Graph Structured Stack or CONS to avoid copying lists when parsing with Many, Many1.
  • Augment the Result structure with Left/Right structure to report error messages.
  • Make process tracing possible.
  • Make predictive/non-consuming combinators (starting from << and >>).
  • Detect left recursion.