Please read: Since this project started, a lot of lexing/parsing libraries emerged, most of them doing the job better than lexington. (Have a look at instaparse and be amazed.) I will thus not continue working on this library, except for fixing issues that may arise.
lexington is aimed at simplifying the creation of extensible and combinable lexers. Written in Clojure it offers a customizable infrastructure and (so far) some predefined helper utilities. Still a work in progress, I hope it provides at least a little bit of usefulness - and if not that, perhaps a light chuckle?
Include via Leiningen:
[lexington "0.1.1"]
A lexer is just a function consuming a sequence of input entities (e.g. characters) and producing a sequence of tokens derived from that input. Tokens are simple Clojure maps with three mandatory fields:
(lexington.tokens/new-token :string (seq "Hello"))
; ->
{ :lexington.tokens/type :string
:lexington.tokens/data (\h \e \l \l \o)
:lexington.tokens/length 5 }
So, one can manually build lexers by examining input sequences by hand and creating output using new-token
when
needed. Or use the lexer
macro (and its def
-pendant deflexer
) which associates token types with matching
instructions:
(ns test
(:use lexington.lexer
lexington.utils.lexer))
(deflexer simple-math*
:integer #"[1-9][0-9]*"
:plus "+"
:minus "-")
; e.g. (simple-math* "1+2+3")
Matching instruction include:
- regular expressions
- strings
- a matching function (returning the number of input entities that match it)
This list can be extended by implementing the protocol lexington.seq-matchers/Matchable
.
Lexers can include and extend other lexers:
(deflexer simple-math-ws*
(:include simple-math*)
:ws #" |\t|\r|\n")
; e.g. (simple-math-ws* "1 + 2 + 3")
There is a problematic detail regarding the regular-expression-matching which consists of having to realize the whole input sequence to perform it. Avoiding this and thus taking advantage of lazy sequences should probably be in the scope of this project.
So far the lexer is dumb. It produces token after token until it reaches a point it can't cope with. But, for example,
we don't need whitespace in our result, so we should probably remove it; additionally it would probably be nice to
have the actual integer value of our :integer
tokens in the resulting token map. We can do this:
(def simple-math
(-> simple-math-ws*
(discard :ws)
(with-string :str
:only [:integer])
(generate-for :integer
:int #(Integer/parseInt (:str %)))))
Have a look at the lexington.utils.lexer
namespace for more possibilities and insight.
Since Clojure supports the generation of Clojure code at compile-time (via macros) it might be desirable to have some kind of grammar DSL and means to transform it into parser code, without the hassle of the usual "edit grammar"-"regenerate code"-"compile it"-cycle. This is actually what got this project started since a parser without a usable lexer is only half the fun. Where this aspect of the project goes remains to be seen.
You can generate Marginalia documentation for this project by adding
lein-marginalia
to you :user
profile in ~/.lein/profiles.clj, e.g.:
{ :user { :plugins [ ... [lein-marginalia "0.7.1"] ...] } }
Then call:
lein marg -d doc/html
and direct your browser to the file doc/html/uberdoc.html
.