Follow this project by going to each version, you will have the files needed from the actual old commit of elixir's project and my Readme that explains some stuff probably.
https://arifishaq.wordpress.com/2014/01/22/playing-with-leex-and-yeec/ http://blog.rusty.io/2011/02/08/leex-and-yecc/
Important files elixir_lexer.erl and elixir_parser.yrl
We start with leex. No theory here. There are plenty of resources on the web that explain what a “lexer” does. It will break our string into tokens in a format yecc wants. Each token will say what type of input we are dealing with.
%% leex file structure
Definitions.
Rules.
Erlang code.
We’ll start in the middle. With the rules. They are the heart of it all.
A rule is made up of two parts. A regular expression (a la Erlang) and Erlang code. This code will be invoked if the parsed string matches the regular expression and will generally return a token tuple of the form {token, Token}
.
The regular expression and the code are separated by whitespace, a colon and other whitespace. The code can call functions implemented in the third section, Erlang code.
The code may return
{token, Token}
{end_token, Token}
when the returned token is known to be the last token.{skip_token, Token}
when the token in question is to be ignored,{error, ErrorDescription}
for errors
Back to the Token. This is a tuple with generally three fields: {Category, Location, Content}.
Token is {Category, Location, Content}
Category
tells us something about the nature of the token. Whether it is a number, a string, etc.Location
specifies where in the input we found the token. This is probably used for debugging purposes, but is required by yecc. The Erlang code can assume there is a variable TokenLine which is bound to the line number.Content
is what we would like the parser to believe was input as a token. It may be what was actually present in the parsed string, but it could be something we decide. The Erlang code can assume the availability of variables TokenLen and TokenChars. The former is the length of the matched substring. The latter is the matched substring.
There are times when you will want the Category to be the same as the Content. In such a situation your code can also return just
{Category, Location}
.
While we are at it, let’s just say a word about the definitions sections. It allows us to define macros which we can use in specifying the regular expressions for the rules.
The definitions take the form of Macro = Value. No dot at the end here!
Now lets see what we have now in elixir
First lets compile everything and see by example Lets first compile everything by hand
leex:file('./src/elixir_lexer').
gives leex our lexerc("./src/elixir_lexer.erl").
compiles the lexer and exposes the string/1 functionelixir_lexer:string("2")
should return{ok,[{integer,1,2}],1}
, this means it is working.
What happened now
"1" was fed to the string function of elixir_lexer and according to our specified rules it spits out erlang code that can now be fed to our parser
Lets play a little bit before we continue with parser Lets see how the lexer file now looks
elixir_lexer.erl
Definitions.
Digit = [0-9]
Rules.
%% Numbers
{Digit}+\.{Digit}+ : { token, { float, TokenLine, list_to_float(TokenChars) } }.
{Digit}+ : { token, { integer, TokenLine, list_to_integer(TokenChars) } }.
%% Operators
\+ : { token, { '+', TokenLine } }.
- : { token, { '-', TokenLine } }.
\* : { token, { '*', TokenLine } }.
/ : { token, { '/', TokenLine } }.
\( : { token, { '(', TokenLine } }.
\) : { token, { ')', TokenLine } }.
Erlang code.
We see now that for example this rule
{Digit}+ : { token, { integer, TokenLine, list_to_integer(TokenChars) } }.
follows the {Category, Location, Content}
speicification. Category is integer, Location is TokenLine and Content is the erlang function result of list_to_integer.
We also see the usage of {Category, Location}
in the Operators.For example
\+ : { token, { '+', TokenLine } }.
Says that the Category and the Content is the same and it is the plus sign (+) and the Location is TokenLine
- We have a definition of Digit in regex which is any number from 0-9
- Next we have the definitions for Numbers. We see a definition for integer and float
{Digit}+
is the integer definition and returns{ token, { integer, TokenLine, list_to_integer(TokenChars) } }
. By our exampleelixir_lexer:string("2")
gave{ok,[{integer,1,2}],1}
. We are interested in the{integer,1,2}
part. It conforms with what is written in our lexer.integer
is a 'constant',1
is the TokenLine and2
is the result oflist_to_integer(TokenChars)
an erlang function that take as input a string and gives an integer. ( You can test it out in the erl repl, typelist_to_integer("12").
){Digit}+\.{Digit}+
is the float definition and returns{ token, { float, TokenLine, list_to_float(TokenChars) } }
. Pretty similar to integer, with the difference that if there is a dot between two integers then this is a float. Utilizing erlangslist_to_float(TokenChars)
we get the float needed in erlang. Lets test this out also.elixir_lexer:string("2.1").
gives{ok,[{float,1,2.1}],1}
. So lexer recognized this as a float. And returnedfloat
instead ofinteger
. And as previously gave the TokenLine and the real number.
- Next we have the Operators definition. We wont analyze them all for brevity, but you can see that for elixir operators are (
+
,-
,*
,/
,(
,)
)\+ : { token, { '+', TokenLine } }.
. What this says is that when the lexer finds a plus sign + it will return{ token, { '+', TokenLine } }
. We can see that if we typeelixir_lexer:string("+").
, it returns{ok,[{'+',1}],1}
.
You can do more
What if we gaveelixir_lexer:string("11*11+(5-10)/2").
this command. Well lexer handles it like a champ and gives out{ok,[{integer,1,11},{'*',1},{integer,1,11},{'+',1},{'(',1},{integer,1,5},{'-',1},{integer,1,10},{')',1},{'/',1},{integer,1,2}],1}
What if we put
elixir_lexer:string("11 + 11").
. This fails miserably and says{error,{1,elixir_lexer,{illegal," "}},1}
. The reason is specified in the result. We have not given a definition for whitespace to our lexer yet.
yecc uses the tokenized description of the string to parse and make stuff happen. Right now elixir lexer handles the arithmetic operations. To do this, it has to be instructed on the semantics of a valid arithmetic representation. These instructions we provide yecc in a yrl file. (elixir_parser.yrl)
Empty yrl whould look like that
Nonterminals.
Terminals.
Rootsymbol arithmetic.
Erlang code.
- Nonterminals - these are syntactic elements that are composed of terminal systems in various patterns;
- Terminals - these are indivisible symbols that form building blocks for composed structures. These are the tokens supplied by the lexer;
- Rootsymbol - this is the root of the syntax tree, i.e. the contextual starting point for interpreting the input symbols;
- Rules - these tell the parser how to classify and structure the tokens it is consuming. A rule has a name, the pattern it matches and the action to be taken when the rule fires. These return the final data structures to the application, so this is the point where you can perform any contextual transformations that are necessary;
- Erlang code - as with the lexer grammar, this is where you can define functions to abbreviate the actions defined in the rules.
Lets split the actiual elixir_parser.yrl file to see what is going on.
Terminals
float integer
'+' '-' '*' '/' '(' ')'
.
Those are what the lexer spit out. We see all the values that we got when analyzing the lexer. All these things are needed to define arithmetic operations.
Nonterminals
arithmetic
add_op
mult_op
number
.
A Grammar rule says how to infer a non-terminal from a sequence of terminals. Erlang code may optionally be associated with the rule and this code may make use of functions in the last section of the instruction file, the Erlang code section.
The general form of the grammar rule is then:
Nonterminal -> sequence of terminals : associated Erlang code.
Lets see a grammar rule
number -> float : '$1'.
number -> integer : '$1'.
Those define what a number nonterminal is for the parser It can be either float or integer and gives '$1'. But what is '$1' ?
yecc provides the atoms ‘$1’, ‘$2’, etc., that can be used in the Erlang code associated with a semantic or grammar rule. These atoms represent the actual token associated with each terminal in the rule. Thus ‘$1’ is the token associated with the first terminal, ‘$2’ the one associated with the second terminal and so on. The token, as we remember from the leex part, looks like: {Category, Location, Content}. This is why the implementation of number uses $1 because it is actually the Category(since Category=Content for numbers)
Pretty much the same for the oparators
%% Addition operators
add_op -> '+' : '$1'.
add_op -> '-' : '$1'.
%% Multiplication operators
mult_op -> '*' : '$1'.
mult_op -> '/' : '$1'.
We have an add_op
nonterminal if we see a +
or a -
terminal
We have a mult_op
nonterminal if we see a '*' or a '/' terminal
Now lets see the interesting stuff that combine our nonterminals to create the semantics for our arithmetic
(nonterminal) operations
arithmetic -> arithmetic add_op arithmetic :
{ binary_op, ?line('$1'), ?op('$2'), '$1', '$3' }.
arithmetic -> arithmetic mult_op arithmetic :
{ binary_op, ?line('$1'), ?op('$2'), '$1', '$3' }.
arithmetic -> '(' arithmetic ')' : '$2'.
arithmetic -> number : '$1'.
Lets start with the easy ones.
arithmetic -> number : '$1'.
That states that if we have a number then we also have an arithmetic operation that returns the actual numberarithmetic -> '(' arithmetic ')' : '$2'.
That states that if we have an arithmetic non terminal between()
terminals, then we also have an arithmetics operation and the result is the second token from the lexer. Because if we writeelixir_lexer:string("(10)").
then what we get from the lexer is{ok,[{'(',1},{integer,1,10},{')',1}],1}
. And$2
is{integer,1,10}
arithmetic -> arithmetic add_op arithmetic :{ binary_op, ?line('$1'), ?op('$2'), '$1', '$3' }.
That states that if you see something like 2 + 2 then there is a binary operation at the line of the first terminal, the operator of the second terminal and the numbers for addition are the $1 and $3, meaning the first and third terminal- Like 3. but for mult_op
The ?line and ?op functions are utility functions defined at the end of the file
Erlang code.
-define(op(Node), element(1, Node)).
-define(line(Node), element(2, Node)).
They utilize erlangs element function to take the appropriate value from the tuple eg. for (3) if expressions was 2+2, ?line($1) is like saying ?line({integer, 1, 2}) which translated to element(2, {integer, 1,2}) and returns 1 as the line number, which is the second argument of the tuple.
Last but not least there is a definition for operators precedence
Left 100 add_op.
Left 200 mult_op.
Which states that mult_op > add_op ( should be calculated first)
Now lets see it in actions!
Lets compile the parser
- yecc:file("./src/elixir_parser.yrl").
- c("./src/elixir_parser.erl"). Exposes parse function to take input from lexer
elixir_parser:parse(element(2,elixir_lexer:string("1+2"))).
returns{ok,{binary_op,1,'+',{integer,1,1},{integer,1,2}}}
. All makes perfect sense!
We used element in the above example to get only the tokens from the lexer and feed them immediately to the parser. Because if you remember the lexer for elixir_lexer:string("1+2") returns {ok,[{integer,1,1},{'+',1},{integer,1,2}],1} and we only want [{integer,1,1},{'+',1},{integer,1,2}].
Finaly there is one more file called elixir.erl that actually exposes a function parse/1 and does what we did with this command elixir_parser:parse(element(2,elixir_lexer:string("1+2"))).
so that we can execute elixir code easier.
-module(elixir).
-export([parse/1]).
parse(String) ->
{ok, Tokens, _} = elixir_lexer:string(String),
{ok, ParseTree} = elixir_parser:parse(Tokens),
ParseTree.
To use it you just need to compile the elixir.erl file
- c("./src/elixir.erl").
- `elixir:parse("1+2"). and vouala!
That concludes this commit
So make compile and we are done Next go to a terminal and type erl to get into erlang repl
- Date May 24, 2012