/squarepeg

A peg parser in Clojure inspired by OMeta

Primary LanguageClojure

#squarepeg

Credits and Acknowledgements

This software was written by Eric Normand and is released under the Eclipse Public License. You can find it on github.

Special thanks also goes to Christophe Grand for his help.

Quickstart

To use in your project.

Add [squarepeg "0.6.1"] to your project's dependencies then run.

lein deps

In the relevant code, add the following to you ns declaration:

(:use squarepeg.core)

To hack the code

git clone git@github.com:ericnormand/squarepeg.git
cd squarepeg
lein deps
lein test

Introduction

squarepeg is a library for defining PEGs. PEG stands for Parsing Expression Grammar.

The library defines a set of parser combinators for creating grammar rules.

Parsers created with squarepeg are data-structure agnostic. They can of course input Java Strings. But they can also input any seq. This means you can apply a grammar to a sequence of Integers, for example.

You may want to jump right in to the examples in src/squarepeg/examples.clj

There is also a small example of how to use squarepeg to compile a class file for use with Java.

NOTE: squarepeg used to be called clj-peg! But then I realized the name was taken. Now I have a name without a hyphen.

##How it works

squarepeg is defined in terms of combinators. Each combinator is a function which generates an atomic unit of a parser (called a rule). By combining these parts (with combinators! ;-), you can generate complex parsers that can handle a superset of Context Free Grammars.

Note that the combinators are not monadic! I tried monadic combinators but they didn't seem like a good fit for Clojure, which does not have built-in support for Monads, like Haskell does. If you want to find a monadic combinator library, there are plenty around.

First, we'll go over some basic concepts.

###Rules

A rule is a function of four arguments, input, bindings, context and memo.

input is a seq of inputs (a text string, a vector, a lazy list), and bindings is a map of current bindings.

Context is a map of bindings that are not mutable. Think of it as a global scope for the duration of the parse.

Memo is a map used for memoization.

A rule should return (success . . .) or (fail msg) with a failure message.

###Success

Success is defined as a map of five keys, :i, :b, :r, :s, and :m.

:i is the rest of the input after the rule has consumed all it wants.

:b is the bindings available to the rest of the rules. Typically, a rule will assoc new bindings onto the bindings parameter.

NOTE: Returning {} or nil will clear all bindings from subsequent rule calls.

:r is the return value of the rule.

:s is the sequence-context return value, which means it's always a vec.

:m is the memoization map. Parser rules that wish to memoize should assoc their results to this map. Please see mkmemo for an example.

The function success? determines if a rule was successful.

A little more detail on return values

Most rules return a single value, let's call it a. This means you set :r to a and :s to [a].

If you want to return a sequence of values [a b c], set :r to [a b c] and :s to [a b c].

If you want to return no value, you should set :r to nil and :s to [].

Finally, there are times when you want to return a seq of values, but have the parser consider it to be a single value, you would set :r to [a b c] and :s to [[a b c]].

:s is primarily used by the mkseq combinator to collect return values of its subrules.

The rule is simple: if the :r element is equal to the :s element, we are dealing with a seq of values, not a single value.

Failure

Failure is a map with two values, :fail and :m, which are mapped to the failure message and the memo object, respectively.

Use failure? to determine if a rule failed.

###Combinators

Combinators are functions that generate rules.

####Built-in combinators

Combinators are defined in src/squarepeg/core.clj

mknot inverts a rule. Given a rule, mknot returns a new rule that fails when the given rule succeeds, and vice versa. It never returns a value and never consumes input. It is most useful for making a rule that says "and is not followed by . . .".

Example:

(def not-followed-by-whitespace (mknot whitespace))
(not-followed-by-whitespace "abc" {} {} {}) 
    => {:r nil :s [] :i (\a \b \c) :b {} :m {})
(not-followed-by-whitespace " abc" {}) => {:fail "NOT failed"}

mkbind creates a rule that will bind the return value of its first argument (a rule) to its second argument (typically a keyword). Binding is useful if you want to refer to a rule later.

Example:

;; bind the matched digits to :digits
(def digits (mkbind 
    (mk1om (mkpr #(Character/isDigit %))) :digits))
(digits "123" {} {} {}) 
    => {:r [\1 \2 \3] :s [\1 \2 \3] :i nil :b {:digits [\1 \2 \3]}
        :m {}}

mkpr creates a rule that consumes one item from the input. It then calls the given predicate on it. If the predicate returns nil, the rule fails. Otherwise, the rule passes. The return value is the item consumed. (It does not consume input if the predicate fails.)

Example:

;; match only even numbers
(def even (mkpr even?))

mkret changes the current return value. It takes a function which takes the bindings map and a context.

Example:

;; parse digit characters as an int
(def integer (mkret (mkbind (mk1om (mkpr #(Character/isDigit %))) 
                            :digits) 
                    (fn [b c] (Integer/parseInt (:digits b)))))

mknothing makes a rule return nothing.

Example:

;; ignore whitespace
(def ignorewhitespace (mknothing (mk1om #(Character/isSpace %))))

mkseq takes any number of rules and creates a rule that must match all of them in sequence.

Example:

;; match three even numbers then two odds
(def even3odd2 (mkseq even even even odd odd))

mkalt takes any number of rules and tries each one in order. The first rule that matches is returned.

Example:

;; match 'a' or 'b' or 'c' (in that order)
(def aorborc (mkalt (mklit \a) (mklit \b) (mklit \c)))

mkpred takes a predicate and returns a rule that succeeds when the predicate applied to the first input element succeeds. It never returns a value and never consumes input. The predicate operates on the bindings map.

Example:

;; match two numbers if their sum > 100
(def sum>100 (mkseq (mkbind integer :a) (mkbind integer :b)
                    (mkpred #(> (+ (:a %) (:b %)) 100))))

mkzom creates a rule which matches zero or more times. It will match the most possible times. It never fails.

Example:

;; match zero or more whitespace
(def w* (mkzom (mkpr #(Character/isSpace %))))

mkscope creates a scope protector around a rule so that bindings that the given rule creates do not leak into the current scope. This function should be used around your own rules.

Example:

;; a rule that binds but does not protect
(def as (mkbind (mk1om (mklit \a)) :as))
;; a rule that calls as
(def xab (mkseq (mkbind (mk1om (mklit \x)) :as) ;; bind to :as
                (mkscope as)                    ;; make sure as
                                                ;; does not bind
                (mk1om (mklit \b))))

mksub creates a rule that matches a nested seq within the seq.

Example:

;; match a seq of ones
(def ones (mkzom (mklit 1)))
;; match a seq of ones followed by 2s
(def onesthentwos (mkseq (mksub ones) (mkzom (mklit 2))))
(onesthentwos [[1 1 1] 2 2] {}) => SUCCESS
(onesthentwos [1 1 1 2 2] {}) => FAILURE

mk1om creates a rule that matches the given rule at least once. Returns all matches.

Example:

;; match one or more digits
(def digits (mk1om digit))
(digits "1234" {} {} {}) 
    => {:r [\1 \2 \3 \4] :s [\1 \2 \3\ 4] :i nil :b {} :m {}}
(digits "123 4" {} {} {}) 
    => {:r [\1 \2 \3] :s [\1 \2 \3] :i (\4) :b {} :m {}}

mkopt creates a rule that always succeeds. If the rule it is given matches, it returns its value. Otherwise, it succeeds with no return.

Example:

;; optionally match 'xyz'
(def xyz? (mkopt (mkstr "xyz")))

mklit creates a rule that matches a value if it is equal.

Example:

;; match the number 12
(def twelve (mklit 12))
(twelve [12] {} {} {}) => {:r 12 :s [12] :i nil :b {} :m {}}

mkstr create a sequential rule that matches all of the characters of a string.

Example:

;; match the string "hello" followed by whitespace
(def hellow+ (mkseq [(mkstr "hello") (mk1om whitespace)]))

mkmemo creates a new rule which memoizes the given rule. The best way to use this is directly inside of a mkscope when defining a top-level rule for most efficient results. Memoizing is done to trade space efficiency for time efficiency. Effectively using mkmemo will make a parse use linear space and linear time with respect to input size.

mkmatch create a new rule which returns the matched portion of the input. It binds that portion of the input matched by the given rule to :match. It also coerces Strings if possible.

Example:

;; match a "SELECT" statement in a contrived query language.
;; perform a "lookup" of everything after the SELECT followed
;; by whitespace
(def selectstmt (mkscope (mkmemo (mkret (mkseq [(mkstr "SELECT") w+
   (mkmatch (mk1om anything))]) (fn [b c] (lookup (:match b))))))

###Predefined rules

Here are the predefined rules that may come in handy.

always a utility rule (not a combinator) which always matches.

never a utility rule (not a combinator) which never matches.

anything a utitily rule that consumes one item of input and returns it.

whitespace matches a single character that is a space.

digit matches a single character that is a digit.

###Utility

mkfn creates a function that is a little bit more friendly to call yourself but is still valid as a rule.

When called with a seq of inputs, it calls the given rule with empty bindings and throws a RuntimeException if the parse fails and the :r value if it succeeds.

mkfn functions can also be called with input + context.

When called with input, bindings, context, and memo, it acts as a normal rule.

##Other documents

See RELEASENOTES.md for a history of the project.

See FUTUREPLANS.md to know where the project is headed.