C Compiler in Python
Group Members: Charles Hermann, Izeah Olguin, Paco Coursey
Last Updated: 04/28/2019
Quickstart
Clone the repository:
$ git clone https://github.com/pacocoursey/compiler.git
Install the dependencies (requires pip3):
$ make install
Overview
Our python compiler currently meets the standards expected of the third and final deliverable.
Our scanner reads in a file as a string of characters and produces a list of labeled tokens. These tokens are acquired by the parser and are used to produce a parse tree. The parse tree is used to construct a symbol table. Then the parse tree and symbol table are used to generate an intermediate representation of the program.
The parse tree is created using action and goto tables generated from our grammar rules. We use an LR(1) shift reduce parser. We then flatten out the recursive grammar rules of the parse tree using a recursive depth first traversal to remove redundant tree nodes.
The symbol table is created by scanning the parse tree in depth first order and creating new scopes for each function declaration. We error check for undefined identifiers and duplicate declarations here.
Our intermediate representation is also generated by a depth first traversal of the parse tree. We call an IR method for each node of the tree and save the result as a list of strings. We use three-address code in combination with temporary variables to represent the program.
The assembly code is generated by transforming each intermediate representation instruction into its relevant assembly instruction. We store all variables on the stack and only use registers for temporary storage or instructions requiring their use.
As a group, we have met numerous times to discuss our design, implementation and overall expectations of our compiler.
Usage
Our compiler uses Python 3.
Character Count
This program returns the number of characters found in a given file. Run using:
$ python3 -m src.character_count FILENAME
Compiler Flags
Run the compiler without any flags to complete all stages of the compiler and generate a .s
file (equivalent to gcc -O0 -S filename
):
$ python3 -m src.main FILENAME
# compare with:
$ gcc -O0 -S FILENAME
-h
or --help
Print usage information.
-v
or --verbose
Logs additional output to a file in /logs
.
-f
or --force
Forcefully generate new action and goto tables from the specified grammar rules and save them in /tables
.
-g
or --grammar
Specify a grammar to use. Defaults to /grammars/main_grammar.txt
. Run using:
$ python3 -m src.main -g GRAMMAR_FILENAME FILENAME
# or
$ python3 -m src.main --grammar GRAMMAR_FILENAME FILENAME
-s
or --scan
Tokenizes a C program and returns a list of the known tokens. Run using:
$ python3 -m src.main -s FILENAME
# or
$ python3 -m src.main --scan FILENAME
-p
or --parse
Converts a list of tokens generated by the scanner and constructs an abstract representation of the program using a pre-defined C grammar. Run using:
$ python3 -m src.main -p FILENAME
# or
$ python3 -m src.main --parse FILENAME
-t
or --table
Generate a symbol table using the parse tree to keep a list of scopes and defined variables. Run using:
$ python3 src/main -t FILENAME
# or
$ python3 src/main --table FILENAME
-r
or --ir
Generate an Intermediate Representation. Run using:
$ python3 -m src.main -r FILENAME
# or
$ python3 -m src.main --ir FILENAME
-i
or --input
Read in an IR instead of source file. Run using:
$ python3 -m src.main -i INPUT_IR
# or
$ python3 -m src.main --input INPUT_IR
-o
or --output
Write out the IR to a file. Run using:
$ python3 -m src.main -r -o OUTPUT_FILENAME FILENAME
# or
$ python3 -m src.main --ir --output OUTPUT_FILENAME FILENAME
-a
or --asm
Generate assembly instructions from the IR. Run using:
$ python3 -m src.main -a FILENAME
# or
$ python3 -m src.main --asm FILENAME
-n
or --asmOutput
Write out the assembly to a file. Run using:
$ python3 -m src.main -a -n OUTPUT_FILENAME FILENAME
# or
$ python3 -m src.main --asm --asmOutput OUTPUT_FILENAME FILENAME
Design Discussion
Scanner Implementation
Our scanner parses characters in a 'chunk', with a start an and end counter. We iterate through the chunk to match our tokens in the following order: Symbols, Operators, Keywords, Identifiers and Numbers.
We are not optimizing for speed, which is why we opted not to use regular expressions. This gives us more logical control over what tokens we recognize, and it is easier to handle edge cases like multi-line comments.
Parser Implementation
Our parser uses action and goto tables generated from the rules in grammars/main_grammar.txt
. Because generating the tables takes so long, after the first generation they are saved in JSON format in the tables/
directory for future compiler executions. The parser outputs a parse tree consisting of instances of custom node classes defined in grammar.py
. The parse tree nodes are highly abstracted and do not include unimportant tokens like brackets or parentheses.
After the initial creating of the parse tree, it is "flattened" by un-nesting recursive grammar nodes. This makes it easier to generate the symbol table and removes useless duplicate nodes from the tree.
Symbol Table Implementation
Our symbol table uses the parse tree to create a new scope for each function declaration. We save each variable declaration inside the appropriate scope, including a global scope. While generating the symbol table we also check for duplicate variable, function declaration and undefined identifiers.
The symbol table also keeps track of goto labels and checks for invalid code that calls goto labels that are never declared. This is done differently to undefined identifier and function checking, as goto labels can be declared anywhere in the program.
IR Implementation
The intermediate representation is generated by visiting each node of the parse tree in depth first order and calling an IR method on each node. These IR methods are defined differently for each node in grammar.py
. The IR methods currently return strings, and are saved in a simple list.
We use three-address code and intermediate variables to assist with our IR generation. The intermediate variables are uniquely generated so that we can avoid unclear assignments.
Our compiler can skip all of the above steps and start from an already generated IR file by using the -i
or --input
flags. You can dump the intermediate representation of a program to a file using the -o
or --output
flags.
ASM Implementation
The assembly code is generated by converting each line of the intermediate representation to equivalent assembly instructions. We keep all variables on the stack and assume all registers to be strictly temporary. There are no race conditions, however, as each IR instruction is parsed and converted as an atomic operation. Parameters to function calls are stored in the registers %r8d
through %r15d
. This means we can support function calls that have up to 8 parameters, but not more than that. Those registers are also temporary, and cannot be relied upon outside of the logic surrounding a callq
statement.
At the start of every function, we grow the stack by 4 * number of local variables
. This ensures there is no memory address overlap when calling other functions. Each function is declared in the ASM as the name given in the C code with an underscore prepended. The only exception is on Linux, where we leave the main function as main
instead of _main
to prevent errors.
As an example, the IR instruction i = 10 / 2
would be translated into the assembly instruction movl $5, -4(%rbp)
(where the variable i
is stored at -4(%rbp)
). As you can see in that example, we do a small optimization and pre-calculate any simple expressions in which both operands are numbers.
Design Benefits
- Few files, things can be modified quickly
- Design and hierarchy is well established
- Good separation of concepts into modularized files
- Able to iterate quickly on all aspects of the compiler
Limitations
- Does not support function calls with more than 8 parameters
- Cannot nest function calls i.e.
sum(sum(2, 3), 5)
- Switch cases must be wrapped in curly braces
{}
- Math expressions must be parenthesized to be evaluated correctly
- Grammar does not support chained variable declarations like
int i, j
Grammar Specification
Required Features | Scanner | Parser | IR | ASM |
---|---|---|---|---|
Identifiers | ✔ | ✔ | ✔ | ✔ |
Variables | ✔ | ✔ | ✔ | ✔ |
Functions | ✔ | ✔ | ✔ | ✔ |
Keywords | ✔ | ✔ | ✔ | ✔ |
Arithmetic Expressions | ✔ | ✔ | ✔ | ✔ |
Assignment | ✔ | ✔ | ✔ | ✔ |
Boolean Expressions | ✔ | ✔ | ✔ | ✔ |
Goto Statements | ✔ | ✔ | ✔ | ✔ |
If/Else Control Flow | ✔ | ✔ | ✔ | ✔ |
Unary Operators | ✔ | ✔ | ✔ | ✔ |
Return Statements | ✔ | ✔ | ✔ | ✔ |
Break Statements | ✔ | ✔ | ✔ | ✔ |
While Loops | ✔ | ✔ | ✔ | ✔ |
Optional Features | Scanner | Parser | IR | ASM |
---|---|---|---|---|
Floats | ✔ | ✔ | ✔ | |
++, --, -=, +=, *=, /= |
✔ | ✔ | ✔ | ✔ |
For Loops | ✔ | ✔ | ✔ | |
&, |, ^ Operators |
✔ | ✔ | ✔ | ✔ |
<<, >>, ~ Operators |
✔ | ✔ | ✔ | ✔ |
Switch Statements | ✔ | ✔ | ✔ | ✔ |
Extra Features | Scanner | Parser | IR | ASM |
---|---|---|---|---|
Pointers | ||||
Arrays | ||||
Strings | ✔ | ✔ | ||
Include Statements | ✔ | ✔ | ||
Struct | ✔ | ✔ | ||
Enum | ✔ | ✔ | ||
Casting | ||||
Type Promotion | ||||
Type Specs |
Code Style
Formatted using Black. Linted using PyLint.