PHP PEG - A PEG compiler for parsing text in PHP
This is a Parsing Expression Grammar compiler for PHP. PEG parsers are an alternative to other CFG grammars that includes both tokenization and lexing in a single top down grammar. For a basic overview of the subject, see http://en.wikipedia.org/wiki/Parsing_expression_grammar
Quick start
- Write a parser. A parser is a PHP class with a grammar contained within it in a special syntax. The filetype is .peg.inc. See the examples directory.
- Compile the parser. php ./cli.php ExampleParser.peg.inc > ExampleParser.php
- Use the parser (you can also include code to do this in the input parser - again see the examples directory):
$x = new ExampleParser( 'string to parse' ) ; $res = $x->match_Expr() ;
Parser Format
Parsers are contained within a PHP file, in a special comment block that starts with /*Parser:NameOfParser
and continues until the
comment is closed. During compilation this block will be replaced with a set of matching functions.
Lexically, the parser is a name token, a matching rule and a set of functions. The name token must not start with whitespace, contain no whitespace
and end with a :
character. The rule and function set are on the same line or on the indented lines below.
Rules
PEG matching rules try to follow standard PEG format, summarised thusly:
token* - Token is optionally repeated token+ - Token is repeated at least one token? - Token is optionally present tokena tokenb - Token tokenb follows tokena, both of which are present tokena | tokenb - One of tokena or tokenb are present, prefering tokena &token - Token is present next (but not consumed by parse) !token - Token is not present next (but not consumed by parse) ( expression ) - Grouping for priority
But with these extensions:
< or > - Optionally match whitespace [ or ] - Require some whitespace
Tokens
Tokens may be
- bare-words, which are recursive matchers - references to token rules defined elsewhere in the grammar,
- literals, surrounded by
"
or'
quote pairs. No escaping support is provided in literals. - regexs, surrounded by
/
pairs. - expressions - single words (match \w+) starting with
$
or more complex surrounded by${ }
which call a user defined function to perform the match
Regular expression tokens
Automatically anchored to the current string start - do not include a string start anchor (^
) anywhere.
Can specify flags on stand-alone regexs. Currently doesn't handle flags on regexs with rules.
Expressions
Expressions allow run-time calculated matching. You can embed an expression within a literal or regex token to match against a calculated value, or simply specify the expression as a token to (optionally) internally handle matching and generate a result.
Expressions will try a variety of scopes to find a value. It will look for variables already set in the current result, rule-attached functions and a variety of other functions and constants.
Tried in this order
- against current result
- against containing expression stack in order (for sub-expressions only)
- against parser instance as variable
- against parser instance as rule-attached method INCLUDING
$
( i.e.function $foo()
) - against parser instance as method INCLUDING
$
- as global method
- as constant
Tricks and traps
Be careful against matching against results
quoted_good: q:/['"]/ string "$q" quoted_bad: q:/['"]/ string $q
"$q"
matches against the value of q again. $q
simply returns the value of q, without doing any matching
Named matching rules
Tokens and groups can be given names by prepending name and :
, e.g.,
rulea: "'" name:( tokena tokenb )* "'"
There must be no space betweeen the name and the :
badrule: "'" name : ( tokena tokenb )* "'"
Recursive matchers can be given a name the same as their rule name by prepending with just a :
. These next two rules are equivilent
rulea: tokena tokenb:tokenb rulea: tokena :tokenb
Rule-attached functions
Each rule can have a set of functions attached to it. These functions can be defined
- in-grammar by indenting the function body after the rule
- in-class after close of grammar comment by defining a regular method who's name is
{$rulename}_{$functionname}
, or{$rulename}{$functionname}
if function name starts with_
- in a sub class
All functions that are not in-grammar must have PHP compatible names (see PHP name mapping). In-grammar functions will have their names converted if needed.
All these definitions define the same rule-attached function
class A extends Parser { /**Parser foo: bar baz function bar() {} * / function foo_bar() {} } class B extends A { function foo_bar() {} }
PHP name mapping
Rules in the grammar map to php functions named match_{$rulename}
. However rule names can contain characters that php functions can't.
These characters are remapped:
'-' => '_' '$' => 'DLR' '*' => 'STR'
Other dis-allowed characters are removed.
Results
Results are a tree of nested arrays.
Without any specific control, each rules result will just be the text it matched against in a ['text']
member. This member must always exist.
Marking a subexpression, literal, regex or recursive match with a name (see Named matching rules) will insert a member into the result array named that name. If there is only one match it will be a single result array. If there is more than one match it will be an array of arrays.
You can override result storing by specifying a rule-attached function with the given name. It will be called with a reference to the current result array and the sub-match - in this case the default storage action will not occur.
If you specify a rule-attached function for a recursive match, you do not need to name that token at all - it will be call automatically. E.g.
rulea: tokena tokenb function tokenb ( &$res, $sub ) { print 'Will be called, even though tokenb is not named or marked with a :' ; }
You can also specify a rule-attached function called *
, which will be called with every recursive match made
rulea: tokena tokenb function * ( &$res, $sub ) { print 'Will be called for both tokena and tokenb' ; }
Silent matches
By default all matches are added to the 'text' property of a result. By prepending a member with .
that match will not be added to the ['text'] member. This
doesn't affect the other result properties that named rules' add.
TODO
- Allow configuration of whitespace - specify what matches, and wether it should be injected into results as-is, collapsed, or not at all
- Allow inline-ing of rules into other rules for speed
- More optimisation
- Make Parser-parser be self-generated, instead of a bad hand rolled parser like it is now.
- Slighly more powerfull expressions:
${parent.q}
,${foo()->bar}
, etc. - Need to properly escape all literals. Expressions currently need to be in '', not ""
- PHP token parser, and other token streams, instead of strings only like now