A lexical analyzer (or tokenizer) converts the raw sequence of characters from the code into sequence of tokens. In order to do that, the lexical analyzer scans through the sequence of characters in the code, groups them together into lexemes, and identifies the token class for each lexeme. This project implements a basic lexical analyzer (implemented in C++) that scans streams of C code.
- Install
cmake
from here, or runsudo apt-get install cmake
in your terminal/shell. - Run the
build.sh
script as follows:bash build.sh
.
Lexical Analyzer is implemented in src/Lexer.cpp
. Below are the token classes implemented.
- Note: Template in
src/Lexer.cpp
was originally provided by Saikat Chakraborty on 9/11/2020. Functions stateTransition() and tokenizeCode() written by Carmen Wu - Keywords: any tokens from the list [if/else, for, while, extern, asm]
- Token class: KEYWORD
- Identifiers: any tokens that begin with an alphabetic (including both capital and lowercase) character or an underscore (_), followed by at most 16 alphanumeric characters and/or underscore (EXCEPT for the keyword tokens)
- Examples of valid identifiers: test, test1, _id1, and test_1_id_2
- Token class: ID
- Numbers: any numerical tokens optionally containing a decimal point/period (.), i.e., both integers and floating-point numbers
- Examples of valid numbers: 1, 1.0, 1.01, and .01
- Token class: NUMBER
- Strings: any tokens represented by a sequence of characters (including the empty sequence) that begins and ends with double quotes ("). You are not required to handle escape characters like
\"
.- Examples of strings: "Hello", "", and "1.01"
- Token class: STRING
- Comments : any tokens represented by a sequence of characters beginning with a double slash (//) and that ends with a newline (\n)
- Examples of comments: //Hello\n, //""\n, and //"1.01"\n
- Token class: COMMENT
- Operators : any tokens represented by a unary operator, binary operator, or an assignment operator, specifically from the list [_++, --, >, <, >=, <=, ==, =_]
- Token classes: UNOP, BINOP, ASSIGNOP
- Parentheses, Braces, and Semicolons : any tokens from the list [_(, ), {, }, ;_]
- Token classes: LPAR, RPAR, LBRACE, RBRACE, SEMICOLON
Input:
if(n > 0) {
print("Hello World");
}
Expected Output:
<KEYWORD, if>
<LPAR, (>
<ID, n>
<BINOP, >>
<NUMBER, 0>
<RPAR, )>
<L-CURLY-BRACE, {>
<ID, print>
<LPAR, (>
<STRING, "Hello World">
<RPAR, )>
<SEMICOLON, ;>
<R-CURLY-BRACE, }>
Note: Print such representation of token sequences using printTokenStream
function in src/Main.cpp
file.