A simple java implementation of a Lexer in pure Java, made for the course "Compiler" at University of Salerno.
- Java 11
- Maven
This project is provided with an IntelliJ Configuration:
To run it just import it as a Maven project and click Run
Alternatively you can compile manually typing:
mvn package
The jar file will be placed under the target/ directory.
To run the Tester
main function, type:
java -cp target/Lexer-Crisci-Cuccurullo-1.0-SNAPSHOT.jar lexer.com.compiler.Tester FILEPATH
In this section it will be described which symbols the Lexer object can recognize and its class structure.
This Lexer implementation recognizes the following Tokens:
- Delimiters
- Keywords
- Identifiers
- Literals
- Separators (like brackets, colon etc.)
- Operators
In the following one can find the regular definition of each Token. Each component of the Lexical Analyzer behaves just as the respective Finite State Machines provided after every regular definition.
-
Identifiers: ([a-zA-Z])+[a-zA-Z0-9]*
-
Literals:
- INT: (([1-9]+[0-9]*)|0)
- FLOAT: INT(.[0-9]+)
-
Strings:
- STR: "[^"]"
-
Operators:
- Assignment:
- AS: <--
- Assignment:
-
Binary operators
- GT: >
- GEQ: >=
- LT: <
- LEQ: <=
- EQ: =
- DIF: !=
-
Arithmetic operators
- PLUS: +
- SUB: -
- MUL: *
- DIV: /
- MOD: %
- POW: ^
-
Logical operators
- AND: &&
- OR: ||
- NOT: !
-
Separators:
- L_PAR: (
- R_PAR: )
- L_CUR: {
- R_CUR: }
- COM: ,
- S_COM: ;
In accord with a classic Lexer structure, a Lexer
class has been defined. It wraps a ByteBuffer
object referring to the source code target file.
The only exposed method is getToken()
, which reads enough characters from the buffer in order to recognize a Token
object and return it to the caller.
Every lexeme in this example language is identified by a LexemeAnalyzer
object. Each one of these is a specialization of the AbstractLexemeAnalyzer
class. An analyzer exposes a method check(ByteBuffer)
, which detects whether (or not) the next characters read in the buffer correspond to the lexeme that it represents.
It returns a RecognizedToken
object, which is a data structure composed of a Token
and an int numCharRead
representing the lexeme lenght.
The value
numCharRead
is used to define the priorities among the tokens: when a literal matches 2 or more patterns, the longest one is chosen. When 2 or more recognized tokens have the same lenght, the most significant Token in the language specification in chosen.
The Lexer
object uses a support structure StringTable
, which is a simple redenomination for a HashMap<Integer,String>
. This structure holds a single copy for each identifier string found during the analysis phase, whose position in the structure is saved in the attribute
field of Token
object.
For example, if a string "variable" is found and saved in the
StringTable
at position 0, aToken
<ID,0> will be returned to the caller.
The recognition phase is "greedy": the analyzer reads the characters from the buffer and returns the correct token found before the error.
For example, if the string "053" is analyzed by the NumberLexemeAnalyzer, it will return the token <INT,0> because "0" is the correct token recognized before reading the character "5" (an INT cannot start with 0, except if it is exactly the number 0)
If no token were recognized before entering an error state (for example, when the first character read does not match any accepted pattern), an ERROR
token is returned to the caller.
Two causes of failure have been dealt with during the lexical analysis:
- firstly, the character sequence that has been analyzed does not match any of the patterns, in which case an
ERROR
Token object is returned to the caller; - secondly, at the very end of the analysis, the Buffer object on which the Lexer is operating will be empty and a static
EMPTY TOKEN
object is returned to the caller. This will prevent infinite loops and the whole process will terminate.
While the assignment paper for this work explicitly banned Regex for the token recognition, there are section of this program that use Regex.
Nevertheless, the usage of Regex is this program is limited to a simple group of characters recognition phase, in which it is checked if a character belongs to said set.
For example, to check if the character 'a' belongs to the set [a-zA-Z] (all the letters).
Thus the usage of regex does not have impact the original FSM-based recognition technique required.
- Luigi Crisci
- Alessio Cuccurullo