/canopy

Self-hosting PEG parser compiler for JavaScript

Primary LanguageJavaScriptOtherNOASSERTION

Canopy

Canopy is a parser compiler for JavaScript, based on Parsing Expression Grammars and heavily influenced by Treetop.

Download

Canopy and the parsers it generates require JS.Class -- we're building on top of this in order to support the composition and node typing features of Treetop, to which Ruby's object system lends itself very well.

You can grap a stable build of Canopy from bin/canopy-stable.js; this build is used to compile Canopy itself during development.

Usage

Parsers are generated from grammar definitions; Canopy uses a fairly standard notation for PEGs. You can create a parser by calling Canopy.compile() with a string containing the grammar for the parser. The parser's name will be the name of the grammar followed by Parser. For example:

Canopy.compile('grammar Integer\
  integer <- [1-9] [0-9]*')

IntegerParser.parse('3178')

// -> {textValue: '3178', offset: 0, elements: [
//      {textValue: '3', offset: 0},
//      {textValue: '178', offset: 1, elements: [
//        {textValue: '1', offset: 1},
//        {textValue: '7', offset: 2},
//        {textValue: '8', offset: 3}
//      ]}
//    ]}

Canopy.compile() both evaluates and returns the source code for the parser it has generated, so you can take that source and save it in a file for later use. The bin directory contains a Rhino script that generates parser files from grammar files, for example to compile Canopy's own grammar parser, I do this:

rhino bin/compile.js source/canopy/meta_grammar.peg

Grammar definitions

The grammar definition syntax of Canopy is very straighforward and uses common PEG idioms. In contrast to other parser compilers, it does not allow inline code but instead allows you to specify named types for syntax nodes; these types can refer to modules containing methods you want to attach to the nodes. A good example grammar is Canopy's metagrammar, which defines the grammar syntax using the syntax itself. See:

source/canopy/meta_grammar.peg

A PEG file should contain the name of the grammar, followed by a list of parsing rules. Each rule maps a name to a parsing expression, the allowed elements of which are listed below. The first rule is assumed to be the root of the parse tree, and can have any name you like.

Strings

A string match produces a single syntax node when the input exactly matches the string. Strings are enclosed in double quotes and may contain escape characters.

grammar StringTest
  root <- "Hello, world"

StringTestParser.parse("Hello, world")
// -> {textValue: "Hello, world", offset: 0}

Wildcards

The wildcard character . matches any single character.

grammar WildcardTest
  root <- .

WildcardTestParser.parse("x")
// -> {textValue: "x", offset: 0}

Character classes

A character class is a list of characters enclosed in brackets, just like you'd use in regular expressions. They can contain ranges like [a-z] and are case sensitive.

grammar CharClassTest
  root <- [0-9]

CharClassTestParser.parse("5")
// -> {textValue: "5", offset: 0}

Repetition

Any expression may be followed by one of the repetition characters, ?, * or +, meaning "zero or one", "zero or more" and "one or more" occurences of the preceeding expression. The ? operator passes through the match if one is found, or an empty-string node otherwise. The others return nodes containing lists of their matches.

grammar MaybeTest
  root <- "a"?

MaybeTestParser.parse("a")
// -> {textValue: "a", offset: 0}

MaybeTestParser.parse("")
// -> {textValue: "", offset: 0}

MaybeTestParser.parse("b")
// -> null


grammar ZeroOrMoreTest
  root <- "a"*

ZeroOrMoreTestParser.parse("")
// -> {textValue: "", offset: 0, elements: []}

ZeroOrMoreTestParser.parse("aa")
// -> {textValue: "aa", offset: 0, elements: [
//      {textValue: "a", offset: 0},
//      {textValue: "a", offset: 1},
//    ]}


grammar OneOrMoreTest
  root <- "a"+

OneOrMoreTestParser.parse("")
// -> null

OneOrMoreTestParser.parse("aa")
// -> {textValue: "aa", offset: 0, elements: [
//      {textValue: "a", offset: 0},
//      {textValue: "a", offset: 1},
//    ]}

Lookaheads

Positive and negative lookaheads are denoted by placing & or ! in front of an expression. A lookahead works by inspecting the input for a match and then proceeding without consuming input. For example, the word foo followed a space could be written "foo" &" ", whereas a non-space character could be written !" " .. If the lookahead fails then null is returned, otherwise an empty-string node is produced.

Note that, for example, when the expression "foo" &" " is matched against the string "foo ", the match succeeds but only the string "foo" is consumed. The space remains unconsumed in the input.

Sequences

A sequence is two or more expressions separated by spaces. The input matches a sequence only if it matches each expression in the sequence in turn.

grammar SequenceTest
  root <- "foo" "bar"

SequenceTestParser.parse("foobar")
// -> {textValue: "foobar", offset: 0, elements: [
//      {textValue: "foo", offset: 0},
//      {textValue: "bar", offset: 3}
//    ]}

SequenceTestParser.parse("foo")
// -> null

Items in sequences can be labelled to make the resulting parse tree easier to navigate. To label an expression, prefix it with a name and a colon (:). The elements array is still available, but some data will be copied to named fields on the sequence node:

grammar LabelTest
  root <- first:[a-z] (", " letter:[a-z])*

LabelTestParser.parse("a, b, c")
// -> {textValue: "a, b, c", offset: 0, elements: [
//      {textValue: "a", offset: 0},
//      {textValue: ", b, c", offset: 1, elements: [
//        {textValue: ", b", offset: 1, elements: [
//          {textValue: ", ", offset: 1},
//          {textValue: "b", offset: 3}
//        ],
//          letter: {textValue: "b", offset: 3}
//        },
//        {textValue: ", c", offset: 4, elements: [
//          {textValue: ", ", offset: 4},
//          {textValue: "c", offset: 6}
//        ],
//          letter: {textValue: "c", offset: 6}
//        }
//      ]}
//    ],
//      first: {textValue: "a", offset: 0}
//    }

References

Referring to other rules is what gives grammars their expressive power, allowing them to express recursive structures. To refer to another rule, you just enter its name as part of an expression. When used as part of a sequence, a reference creates labelled nodes in the output.

grammar ReferenceTest
  first   <- second "done."
  second  <- "start."

ReferenceTestParser.parse("start.done.")
// -> {textValue: "start.done.", offset: 0, elements: [
//      {textValue: "start." offset: 0},
//      {textValue: "done.", offset: 6}
//    ],
//      second: {textValue: "start." offset: 0}
//    }

Choices

Choices are lists of expressions separated by " / " and denote a prioritized choice between several alternatives. The input is matched against each alternative in turn, and the first match to succeed is kept.

grammar ChoiceTest
  root <- "foo" / "bar"

ChoiceTestParser.parse("foo")
// -> {textValue: "foo", offset: 0}

ChoiceTestParser.parse("bar")
// -> {textValue: "bar", offset: 0}

ChoiceTestParser.parse("abc")
// -> null

Note that once a choice has succeeded, no further inputs are tried, so choices are unambiguous. This is true even if the choice causes later matching to fail, so for example this grammar will never parse the string "foobar" since we always pick the first branch of the choice.

grammar ChoiceTest
  root <- ("f" / "foo") "bar"

Choices bind more loosely than sequences, for example the expression

"foo" "bar" / "abc" "123"

should be interpreted as

("foo" "bar") / ("abc" "123")

Type annotations

Expressions can be tagged with types, allowing you to use mixins to add methods to syntax nodes. Type labels bind to sequences; if you want to tag a choice then wrap the choice in parentheses.

grammar TypeTest
  root <- first:"foo" second:"bar" <Extension>

Extension = new JS.Module({
  convert: function() {
    return this.first.textValue +
           this.second.textValue.toUpperCase();
  }
});

TypeTestParser.parse("foobar").convert()
// -> "fooBAR"


grammar ChoiceTypeTest
  root    <- (alpha / beta) <Extension>
  alpha   <- first:"a" second:"z"
  beta    <- first:"j" second:"c"

ChoiceTypeTestParser.parse("az").convert()
// -> "aZ"

ChoiceTypeTestParser.parse("jc").convert()
// -> "jC"