/C-Compiler

A C compiler written in C#

Primary LanguageC#MIT LicenseMIT

C Compiler in C#

Build Status

This intends to be a full ANSI C compiler. It generates x86 (32-bit) assembly code in linux. The goal is to produce .s files that gcc's assembler and linker could directly use.

Building this compiler

  1. Download and install .NET 6.0 or above.
  2. Run dotnet build [-c %CONFIGURATION%], where %CONFIGURATION% is the build configuration - Debug or Release (Debug by default).

How to use it

Executable files are located in .\bin\%CONFIGURATION%\net6.0 directory. Simply pass in a file name as the only argument: dotnet C-Compiler.dll %source.c% or C-Compiler %source.c%

The compiler will output assembly code to the terminal.

Example

Inside \TestPrograms you can find LinkedList.c:

int printf(char *, int);
void *malloc(unsigned int nbytes);

typedef struct Node {
	int value;
	struct Node *next;
} Node;

Node *cons(int value, Node *tl) {
	Node *hd = malloc(sizeof(struct Node));
	hd->value = value;
	hd->next = tl;
	return hd;
}

void print_list(Node *hd) {
	Node *cur = hd;
	for (; cur != (void *)0; cur = cur->next) {
		printf("%d\t", cur->value);
	}
}

void print_list_recursive(Node *hd) {
	if (hd != (void *)0) {
		printf("%d\t", hd->value);
		print_list_recursive(hd->next);
	}
}

void *flip_list(Node *hd) {
	Node *cur = hd;
	hd = 0;
	while (cur != 0) {
		Node *next = cur->next;
		cur->next = hd;
		hd = cur;
		cur = next;
	}
	return hd;
}

int main() {
	Node *hd = cons(2, cons(1, cons(0, (void *)0)));
	print_list(hd);
	hd = flip_list(hd);
	print_list(hd);
	print_list_recursive(hd);
	return 0;
}

Run the compiler with the file name as the only commandline argument. It would then print out the assembly code. Save the assembly code as LinkedList.s, and use gcc LinkedList.s on a 32-bit Linux to generate an executable file.

You can see that this program uses quite a lot of features: pointers, structs(especially when a member refers to the struct type itself), loops, functions... This compiler supports the whole ISO C. However, it doesn't contain a preprocessor, so you can't directly use NULL (which is #defineed as (void *)0) or #include any header file.

How this compiler works

1. Handwritten Scanner (lexical analysis)

  • Not automatically generated by flex.
  • Written via standard state-machine approach.

The first phase of a compiler is lexical analysis. In this phase, the source code is loaded and scanned through, to produce a list of minimial grammar units - tokens. Tokens are to programming languages just as words are to English. Each token in the C language falls into one of these categories: char, float, identifier, int, keyword, operator, and string.

You might see different classification methods, such as one that has the punctuator, which I group into the operator type.

Let's look at a concrete example. Suppose we have a line of code that looks like this.

int a = ceil(3.2);

After lexical analysis, the code will be tranformed into these tokens:

Source Token kind Data
int Keyword int
a Identifier "a"
= Operator assignment
ceil Identifier "ceil"
( Operator open parenthesis
3.2 Float 3.2, no suffix, indicating that is a double
) Operator close parenthesis
; Operator semicolon

Note that each token has a type, and some type-specific data: for a keyword, you need to know exactly which keyword it is; for a floating number, you need to know what number it is, and whether it's a float or a double; etc.

Each kind of token has its unique pattern, making it recognizable by the program. For example, an identifier always starts with an underscore '_' or a letter, so we won't confuse it with an integer.

The scanner is implemented by hand-designing a finite state automaton for each kind of token. I've drawn a graph for each automaton in /Scanner/FSAGraphs. Each automaton starts with a START state, and perform one state transition on reading each character. Some of the states are ending states, reaching which would indicate a match. If no valid transition could be performed on an input, then this automaton fails to match.

All automata start at the same time. A token is retrieved when only one automaton is in a ending state while all other automata have failed.

2. Handwritten Parser (grammar analysis)

  • Not automatically generated by yacc / bison.
  • Standard recursive descent parser, with a little hack.

The next phase is parsing. In this phase, the tokens are organized into a tree structure. For example, a = 3 * 2; would be a valid C statement, but a = 2 3 3 3 3; could not. Note that even if a C program is syntactically correct (it passes the parser) doesn't necessarily mean it is valid. For example, float f = "some string" is syntactically correct, but it has a type error.

This means that the grammar itself doesn't expose all features of the language. The code must be examined more carefully to extract the meaning. That is done in the next phase: semantic analysis.

In writing the parser, I used these for reference:

The names of the non-terminals mainly follow MSDN.

The C syntax is actually ambiguous. T *a; could probably mean a declaration of a pointer, if T is a typedef name; or just a multiplication, if T is a variable. To solve this problem, I introduced a tiny little environment just for the parser. This environment keeps track of the typedef names in the current scope. This environment info is immediately thrown away, since all types and other things will be checked more carefully in the next phase.

The code for parsing is in /Parser, and it generates a syntax tree defined in /SyntaxTree.

3. Semantic Analysis

  • A type system to perform the C language implicit typecasts and other tasks.
  • An environment system to record the user defined symbols.

In this phase, type information is added, and all implicit typecasts are performed. Furthermore, the meanings of all symbols are made clear.

In the beginning, I thought I would just modify the tree generated by the parser. However, there are simply too many things to do and the tree would be in a confusing state. So instead, I decided that in this phase, a brand new tree would be generated out of the previous tree.

The code for this phase is in /SyntaxTree, and it generates a new tree called the abstract syntax tree, defined in AST.

Implicit type conversions

Type conversions are always possible between any two arithmetic types.

https://www.safaribooksonline.com/library/view/c-in-a/0596006977/ch04.html

Environment

To determine the semantics of a piece of code, we must examine it in the context.

Take the following two pieces of code as an example.

Code snippet 1:

int a = 3;
int b = 4;
a * b;

Code snippet 2:

typedef int a;
a *b;

Both have a line a * b;, but they mean very different things. The first is a multiplication expression (though performing the multplication doesn't seem to do any good...), while the second is a declaration of a variable b.

It turns out that the environment only has an impact on identifiers - we can't conclude what a given identifier refers to, until we're given an environment. So we need to maintain a data structure to record all user-defined identifiers.

So, where could a new identifier come in?

  1. Declaration statements.

    Obviously, the purpose of a declaration is just to introduce a new identifier.

The conversion between arrays and pointers

Consider the following declaration:

int arr[3][2];

This would create an array of arrays, and the memory layout would be:

arr[0][0] arr[0][1] arr[1][0] arr[1][1] arr[2][0] arr[2][1]

If the user writes the following expression

arr[2][1]

then the corresponding element would be visited.

The expression above would be implicitly translated to

*(*(arr + 2) + 1)

It is worth our time to figure out what each step means and what each intermediate result's type is.

  1. arr + 2

    arr is of type int [3][2]. Its value is the base address of the array.

    The standard says you can add an integer to a pointer, but it doesn't say you can add an integer to an array. This means that the array needs to be implicitly cast to a pointer int (*)[2].

    Note: int (*)[2] is not equal to int *[2]. int (*)[2] means "a pointer to int[2]", while int *[2] means "an array that has 2 int*'s". This is because the operator [] has a higher priority over the operator *.

    A pointer and an array are very different. If you use the sizeof operator on a pointer, you would just get 4, since a pointer has the same size as an unsigned integer; if you use the sizeof operator on an array, you would get the total size of it.

    • sizeof(int (*)[2] is 4;

    • sizeof(int [3][2] is 3 * 2 * 4 = 24.

    Now that we've done the implicit type cast, we will perform the actual addition. This is a "pointer + integer" addition, which would be scaled. Since our pointer points to a int [2], the addition would be scaled to sizeof(int [2]) which is 8.

    The result is (base + 2 * 8), and has the same pointer type: int (*)[2].

    arr[0][0] arr[0][1] arr[1][0] arr[1][1] arr[2][0] arr[2][1]
                                            |
                                            result is this address
    
  2. *(arr + 2)

    The next step is dereference. The type of the result would be int [2], but the value is rather special - still (base + 2 * 8), since the value of an array is just the base address.

  3. *(arr + 2) + 1

    This step is pretty much the same. The array is cast into a pointer. This time the pointer has type int * so the scale would be 4. The result of this expression is (base + 2 * 8 + 1 * 4) and has type int *.

    arr[0][0] arr[0][1] arr[1][0] arr[1][1] arr[2][0] arr[2][1]
                                                      |
                                                      result is this address
    
  4. *(*(arr + 2) + 1)

    The final step is dereference. Just pick up the integer according to the address. Farely easy step.

Issues in semantic analysis

  • Certain expressions can change the environment.

    First, consider the code below:

    some_int = foo(0);
    some_struct = foo(1);

    foo isn't declared before being used, so the compiler doesn't know its prototype, making the first usage of foo un-checkable. The compiler will immediately make up a prototype int foo(int) based on the first usage. The second usage then causes a type error.

    In this case, simply using an (undeclared) function introduces a new prototype, thus changes the environment.

    Then, consider this:

    int size = sizeof(struct A { int a; });
    // a new type introduced insize a 'sizeof' expression.
    
    void *ptr = (struct B { int b; } *)0x0;
    // a new type introduced insize a type-cast expression.
    
    struct A a = { 3 };
    struct B b = { 4 };
    // these new types can also be referred to afterwards, but only in the same scope.

    So, we can see that an expression can change the environment.

  • Certain statements like return or case require special environments.

    The current implementation doesn't check that in semantic analysis. However, wrong code would fail in the code generation phase.

  • Conversions between pointers, functions, arrays are not made clear.

    I need to read some documents and implement it properly.

  • Initialization lists are not implemented.

4. Code Generator

  • Generates x86 assembly code.

This is the final phase. X86 assembly code is generated and is able to be sent directly into gcc's assembler to generate linux executable.