C Compiler in Python

Group Members: Charles Hermann, Izeah Olguin, Paco Coursey

Last Updated: 04/28/2019

Quickstart

Clone the repository:

$ git clone https://github.com/pacocoursey/compiler.git

Install the dependencies (requires pip3):

$ make install

Overview

Our python compiler currently meets the standards expected of the third and final deliverable.

Our scanner reads in a file as a string of characters and produces a list of labeled tokens. These tokens are acquired by the parser and are used to produce a parse tree. The parse tree is used to construct a symbol table. Then the parse tree and symbol table are used to generate an intermediate representation of the program.

The parse tree is created using action and goto tables generated from our grammar rules. We use an LR(1) shift reduce parser. We then flatten out the recursive grammar rules of the parse tree using a recursive depth first traversal to remove redundant tree nodes.

The symbol table is created by scanning the parse tree in depth first order and creating new scopes for each function declaration. We error check for undefined identifiers and duplicate declarations here.

Our intermediate representation is also generated by a depth first traversal of the parse tree. We call an IR method for each node of the tree and save the result as a list of strings. We use three-address code in combination with temporary variables to represent the program.

The assembly code is generated by transforming each intermediate representation instruction into its relevant assembly instruction. We store all variables on the stack and only use registers for temporary storage or instructions requiring their use.

As a group, we have met numerous times to discuss our design, implementation and overall expectations of our compiler.

Usage

Our compiler uses Python 3.

Character Count

This program returns the number of characters found in a given file. Run using:

$ python3 -m src.character_count FILENAME

Compiler Flags

Run the compiler without any flags to complete all stages of the compiler and generate a .s file (equivalent to gcc -O0 -S filename):

$ python3 -m src.main FILENAME
# compare with:
$ gcc -O0 -S FILENAME

`-h` or `--help`

Print usage information.

`-v` or `--verbose`

Logs additional output to a file in /logs.

`-f` or `--force`

Forcefully generate new action and goto tables from the specified grammar rules and save them in /tables.

`-g` or `--grammar`

Specify a grammar to use. Defaults to /grammars/main_grammar.txt. Run using:

$ python3 -m src.main -g GRAMMAR_FILENAME FILENAME
# or
$ python3 -m src.main --grammar GRAMMAR_FILENAME FILENAME

`-s` or `--scan`

Tokenizes a C program and returns a list of the known tokens. Run using:

$ python3 -m src.main -s FILENAME
# or
$ python3 -m src.main --scan FILENAME

`-p` or `--parse`

Converts a list of tokens generated by the scanner and constructs an abstract representation of the program using a pre-defined C grammar. Run using:

$ python3 -m src.main -p FILENAME
# or
$ python3 -m src.main --parse FILENAME

`-t` or `--table`

Generate a symbol table using the parse tree to keep a list of scopes and defined variables. Run using:

$ python3 src/main -t FILENAME
# or
$ python3 src/main --table FILENAME

`-r` or `--ir`

Generate an Intermediate Representation. Run using:

$ python3 -m src.main -r FILENAME
# or
$ python3 -m src.main --ir FILENAME

`-i` or `--input`

Read in an IR instead of source file. Run using:

$ python3 -m src.main -i INPUT_IR
# or
$ python3 -m src.main --input INPUT_IR

`-o` or `--output`

Write out the IR to a file. Run using:

$ python3 -m src.main -r -o OUTPUT_FILENAME FILENAME
# or
$ python3 -m src.main --ir --output OUTPUT_FILENAME FILENAME

`-a` or `--asm`

Generate assembly instructions from the IR. Run using:

$ python3 -m src.main -a FILENAME
# or
$ python3 -m src.main --asm FILENAME

`-n` or `--asmOutput`

Write out the assembly to a file. Run using:

$ python3 -m src.main -a -n OUTPUT_FILENAME FILENAME
# or
$ python3 -m src.main --asm --asmOutput OUTPUT_FILENAME FILENAME

Design Discussion

Scanner Implementation

Our scanner parses characters in a 'chunk', with a start an and end counter. We iterate through the chunk to match our tokens in the following order: Symbols, Operators, Keywords, Identifiers and Numbers.

We are not optimizing for speed, which is why we opted not to use regular expressions. This gives us more logical control over what tokens we recognize, and it is easier to handle edge cases like multi-line comments.

Parser Implementation

Our parser uses action and goto tables generated from the rules in grammars/main_grammar.txt. Because generating the tables takes so long, after the first generation they are saved in JSON format in the tables/ directory for future compiler executions. The parser outputs a parse tree consisting of instances of custom node classes defined in grammar.py. The parse tree nodes are highly abstracted and do not include unimportant tokens like brackets or parentheses.

After the initial creating of the parse tree, it is "flattened" by un-nesting recursive grammar nodes. This makes it easier to generate the symbol table and removes useless duplicate nodes from the tree.

Symbol Table Implementation

Our symbol table uses the parse tree to create a new scope for each function declaration. We save each variable declaration inside the appropriate scope, including a global scope. While generating the symbol table we also check for duplicate variable, function declaration and undefined identifiers.

The symbol table also keeps track of goto labels and checks for invalid code that calls goto labels that are never declared. This is done differently to undefined identifier and function checking, as goto labels can be declared anywhere in the program.

IR Implementation

The intermediate representation is generated by visiting each node of the parse tree in depth first order and calling an IR method on each node. These IR methods are defined differently for each node in grammar.py. The IR methods currently return strings, and are saved in a simple list.

We use three-address code and intermediate variables to assist with our IR generation. The intermediate variables are uniquely generated so that we can avoid unclear assignments.

Our compiler can skip all of the above steps and start from an already generated IR file by using the -i or --input flags. You can dump the intermediate representation of a program to a file using the -o or --output flags.

ASM Implementation

The assembly code is generated by converting each line of the intermediate representation to equivalent assembly instructions. We keep all variables on the stack and assume all registers to be strictly temporary. There are no race conditions, however, as each IR instruction is parsed and converted as an atomic operation. Parameters to function calls are stored in the registers %r8d through %r15d. This means we can support function calls that have up to 8 parameters, but not more than that. Those registers are also temporary, and cannot be relied upon outside of the logic surrounding a callq statement.

At the start of every function, we grow the stack by 4 * number of local variables. This ensures there is no memory address overlap when calling other functions. Each function is declared in the ASM as the name given in the C code with an underscore prepended. The only exception is on Linux, where we leave the main function as main instead of _main to prevent errors.

As an example, the IR instruction i = 10 / 2 would be translated into the assembly instruction movl $5, -4(%rbp) (where the variable i is stored at -4(%rbp)). As you can see in that example, we do a small optimization and pre-calculate any simple expressions in which both operands are numbers.

Design Benefits

Few files, things can be modified quickly
Design and hierarchy is well established
Good separation of concepts into modularized files
Able to iterate quickly on all aspects of the compiler

Limitations

Does not support function calls with more than 8 parameters
Cannot nest function calls i.e. sum(sum(2, 3), 5)
Switch cases must be wrapped in curly braces {}
Math expressions must be parenthesized to be evaluated correctly
Grammar does not support chained variable declarations like int i, j

Grammar Specification

Required Features	Scanner	Parser	IR	ASM
Identifiers	✔	✔	✔	✔
Variables	✔	✔	✔	✔
Functions	✔	✔	✔	✔
Keywords	✔	✔	✔	✔
Arithmetic Expressions	✔	✔	✔	✔
Assignment	✔	✔	✔	✔
Boolean Expressions	✔	✔	✔	✔
Goto Statements	✔	✔	✔	✔
If/Else Control Flow	✔	✔	✔	✔
Unary Operators	✔	✔	✔	✔
Return Statements	✔	✔	✔	✔
Break Statements	✔	✔	✔	✔
While Loops	✔	✔	✔	✔

Optional Features	Scanner	Parser	IR	ASM
Floats	✔	✔	✔
`++, --, -=, +=, *=, /=`	✔	✔	✔	✔
For Loops	✔	✔	✔
`&, \|, ^` Operators	✔	✔	✔	✔
`<<, >>, ~` Operators	✔	✔	✔	✔
Switch Statements	✔	✔	✔	✔

Extra Features	Scanner	Parser
Pointers
Arrays
Strings	✔	✔
Include Statements	✔	✔
Struct	✔	✔
Enum	✔	✔
Casting
Type Promotion
Type Specs

Code Style

Formatted using Black. Linted using PyLint.

pacocoursey/compiler

C Compiler in Python

Quickstart

Overview

Usage

Character Count

Compiler Flags

-h or --help

-v or --verbose

-f or --force

-g or --grammar

-s or --scan

-p or --parse

-t or --table

-r or --ir

-i or --input

-o or --output

-a or --asm

-n or --asmOutput