Hand-Coded Lexer

A simple java implementation of a Lexer in pure Java, made for the course "Compiler" at University of Salerno.

Build

Requirements

Java 11
Maven

IntelliJ Configuratiom

This project is provided with an IntelliJ Configuration:

To run it just import it as a Maven project and click Run

Maven Build

Alternatively you can compile manually typing:

mvn package

The jar file will be placed under the target/ directory.
To run the Tester main function, type:

java -cp target/Lexer-Crisci-Cuccurullo-1.0-SNAPSHOT.jar lexer.com.compiler.Tester FILEPATH

Structure

In this section it will be described which symbols the Lexer object can recognize and its class structure.

Lexemes

This Lexer implementation recognizes the following Tokens:

Delimiters
Keywords
Identifiers
Literals
Separators (like brackets, colon etc.)
Operators

Regular definitions

In the following one can find the regular definition of each Token. Each component of the Lexical Analyzer behaves just as the respective Finite State Machines provided after every regular definition.

Identifiers: ([a-zA-Z])+[a-zA-Z0-9]*

ID Finite State Machine
Literals:
- INT: (([1-9]+[0-9]*)|0)
- FLOAT: INT(.[0-9]+)
LIT Finite State Machine
Strings:
- STR: "[^"]"
STR Finite State Machine
Operators:
- Assignment:
  - AS: <--
Binary operators
- GT: >
- GEQ: >=
- LT: <
- LEQ: <=
- EQ: =
- DIF: !=
Arithmetic operators
- PLUS: +
- SUB: -
- MUL: *
- DIV: /
- MOD: %
- POW: ^
Logical operators
- AND: &&
- OR: ||
- NOT: !
OP Finite State Machine
Separators:
- L_PAR: (
- R_PAR: )
- L_CUR: {
- R_CUR: }
- COM: ,
- S_COM: ;
SEP Finite State Machine

Class Structure

In accord with a classic Lexer structure, a Lexer class has been defined. It wraps a ByteBuffer object referring to the source code target file.
The only exposed method is getToken(), which reads enough characters from the buffer in order to recognize a Token object and return it to the caller.

Every lexeme in this example language is identified by a LexemeAnalyzer object. Each one of these is a specialization of the AbstractLexemeAnalyzer class. An analyzer exposes a method check(ByteBuffer), which detects whether (or not) the next characters read in the buffer correspond to the lexeme that it represents. It returns a RecognizedToken object, which is a data structure composed of a Token and an int numCharRead representing the lexeme lenght.

The value numCharRead is used to define the priorities among the tokens: when a literal matches 2 or more patterns, the longest one is chosen. When 2 or more recognized tokens have the same lenght, the most significant Token in the language specification in chosen.

The Lexer object uses a support structure StringTable, which is a simple redenomination for a HashMap<Integer,String>. This structure holds a single copy for each identifier string found during the analysis phase, whose position in the structure is saved in the attribute field of Token object.

For example, if a string "variable" is found and saved in the StringTable at position 0, a Token <ID,0> will be returned to the caller.

Recognition technique

The recognition phase is "greedy": the analyzer reads the characters from the buffer and returns the correct token found before the error.

For example, if the string "053" is analyzed by the NumberLexemeAnalyzer, it will return the token <INT,0> because "0" is the correct token recognized before reading the character "5" (an INT cannot start with 0, except if it is exactly the number 0)

If no token were recognized before entering an error state (for example, when the first character read does not match any accepted pattern), an ERROR token is returned to the caller.

Two causes of failure have been dealt with during the lexical analysis:

firstly, the character sequence that has been analyzed does not match any of the patterns, in which case an ERROR Token object is returned to the caller;
secondly, at the very end of the analysis, the Buffer object on which the Lexer is operating will be empty and a static EMPTY TOKEN object is returned to the caller. This will prevent infinite loops and the whole process will terminate.

Clarification about REGEX

While the assignment paper for this work explicitly banned Regex for the token recognition, there are section of this program that use Regex.
Nevertheless, the usage of Regex is this program is limited to a simple group of characters recognition phase, in which it is checked if a character belongs to said set.

For example, to check if the character 'a' belongs to the set [a-zA-Z] (all the letters).

Thus the usage of regex does not have impact the original FSM-based recognition technique required.

Authors

Luigi Crisci
Alessio Cuccurullo

Luigi-Crisci/Hand-Coded-Lexer