A simple BNF parser generator for Python.
It's recommended that you subclass the bnfparsing.ParserBase
class
whenever you create a new parser. This makes clear the purpose of the
parser.
Create a new parser using string-based grammar.
IF_STMT = """
if_stmt := "if " test
test := name cmp name
cmp := "==" | "!=" | ">" | "<"
name := alpha name | alpha
"""
class IfStmtParser(bnfparsing.ParserBase):
def __init__(self):
super().__init__(ws_handler=bnfparsing.ignore)
self.grammar(IF_STMT)
p = IfStmtParser()
p.parse('if x==y')
Note that you would have to define the rule alpha
yourself if this
was the entirety of the grammar - see below for other options!
Subclass the ParserBase class to create a new parser. You can create rules with a simple BNF (Backus-Naur Form) syntax. The first argument is the name of the rule, the second is the body of the rule.
Rules can call other rules or capture literals, in double or single
quotes. Recursion is permitted. Use the ignore_whitespace
parameter
to instruct the parser to skip over any whitespace between tokens.
The grammar-based approach above is essentially parsing the grammar and then doing this under the hood.
class IfStmtParser(bnfparsing.ParserBase):
def __init__(self):
super().__init__()
# add rules!
self.new_rule('if_stmt', ' "if " test ')
self.new_rule('test', 'name cmp name')
self.new_rule('cmp', ' "==" | "!=" | ">" | "<" ')
self.new_rule('name', 'alpha name | alpha')
p = IfStmtParser()
p.parse('if x==y')
You can also add customised rules at class creation time or dynamically later on.
Customised rules must accept an input string as an argument. If
successful, the custom rule must return a tuple containing the token it
creates and any unconsumed characters from the input string. If it fails,
it must return None
and the original input string.
class CaseParser(bnfparsing.ParserBase):
# create rules as part of the class definition
# using the rule decorator
@bnfparsing.rule
def upper(self, string):
""" Captures any upper-case letter. """
char = string[0]
if char.isupper():
return Token('upper', char), string[1:]
else:
return None, string
# or create them later on...
def lowercase(string):
""" Captures any lower-case letter. """
char = string[0]
if char.islower():
return Token('lower', char), string[1:]
else:
return None, string
p = CaseParser()
# ... as long as you add them as follows
p.rule_from_function('lower', lowercase)
p.parse('a')
p.parse('A')
This can be useful when you don't want 26 options in a row,
e.g. "A" | "B" | "C"
.
Also see bnfparsing.common
. This module contains some useful functions
that can be dropped in as rules. Most parsers will need one or two of
the common functions, which include:
alpha
,lower
andupper
digit
- Sequence versions of the above, e.g.
alpha_run
. These capture sequences of alpha characters. whitespace
When creating a parser, use the ws_handler
option to specify a means by
which the parse should handle whitespace between tokens. A whitespace
handler is a function that is called on the input string before each
literal token is parsed.
bnfparsing.whitespace
defines three handlers for use.
ignore
removes all whitespace between tokens and requires none.ignore_specific
removes all of the given types of whitespace prior to parsing, much like the in-builtstr.lstrip
.require
raises aDelimiterError
if a specific whitespace phrase is not found between tokens. Use theignore
option to optionally remove other whitespace that is surplus to that required
ignore
can be passed directly as a whitespace handler, whereas the other
two are functions that generate handlers - accordingly they must be
passed with arguments, e.g. require(' ')
.
You are free to define your own handler - it must accept an input string
and return a string. You can also specify whether custom rules should
use whitespace handling with the rule_with_option
decorator.
As seen, you can run the parser on an input string using the parse
method. This raises an error if the given string does not fit the rule
set or if there are any tokens remaining - unless you call parse
with
the optional allow_partial
argument.
Otherwise, the parser will consume the string and return an instance of
bnfparsing.Token
. This the top-most node of the syntax tree; any child
nodes represent the components of each node.
Use the value
method to generate the content of each node. For the
nodes at the base of the tree this will return the value in the node. For
all others, value
recursively combines the values of the tokens
beneath it.
# using the example above...
root = p.parse('if x==y')
assert(root.value() == 'if x==y')
for t in root.iter_under():
print(t.token_type, ':', t.value())
Leading to...
>>> 'literal : if'
>>> 'test: x == y'
Tokens also come equipped with a range of methods for searching or iterating over the tokens below them. These include:
child
: returns the nth childseries
: returns a lowest level tokens in a list.level
: recursively returns a list of the nth-deepest tokensfind
: returns all tokens of a given token typeflatten
: returns a new token with the same value, but collapsing the repeatedly recursive tokens generated by recursive rules.
See documentation for more information. Some of these methods come with
an as_str
option, returning lists of strings instead of lists of tokens.
- Expanded set of common functions?
- EBNF syntax, without relying on recursion?S