/dj2ll-public

the LLVM API portion of the dj2ll compiler for diminished java

Primary LanguageC++GNU General Public License v3.0GPL-3.0

dj2ll

This repository contains the header files for a compiler front-end and the implementation files for a compiler back-end built atop the LLVM C++ API. The front-end implementation files are not public due to their being part of a graded project for the spring 2020 compilers elective at the University of South Florida; the professor requests that we keep those assignments private.

The source language for this compiler is Diminished Java (DJ), a language devised by Dr. Ligatti specifically for his compilers course. His language documentation contains more information about the features and capabilities of DJ.

The development of the LLVM-related code herein took place over the course of ten weeks, as part of an independent study assignment with supervision from Dr. Ligatti. The compiler translates the original AST (written in C) into one in C++, which is more amenable to the C++ API provided by LLVM. The compiler then walks the translated AST to generate an object file; dj2ll calls clang to deal with linking.

The files provided here include:

  1. Header and implementation files for all LLVM-related work (codeGen* files),
  2. Header files of my own design (the implementation of these deals directly with the original AST, so they remain private),
  3. Header files provided by Dr. Ligatti to guide the design of the original compiler,
  4. Test programs provided by Dr. Ligatti, and
  5. A Python script to run the compiler against the test programs.

Running the Compiler

The releases tab provides an executable version of dj2ll. I built the executable using Version 10.0.1 of clang and clang++ on Arch Linux. The compiler should work on any Linux system that has access to clang and the LLVM libraries. dj2ll knows about four flags:

  1. --skip-codegen: lex, parse, and typecheck, but skip code generation.
  2. --run-optis: create an optimized executable.
  3. --emit-llvm: output to the console the LLVM IR produced by the source file.
  4. --verbose: output the translated AST.

If you run dj2ll on some file test.dj, it will produce an executable with name test in the same directory from which you called the compiler.

Design choices and areas of note

New Global Values

The llvm_includes.hpp file contains significant global variables used either by LLVM methods or by the code generation methods. The NamedValues map maps a string (representing methods and “main”) to the appropriate symbol table that method should use. This is a limitation of my approach! If a user declares a class named (exactly) C1_method_bar and a class C1 exists that implements a method bar, a colision will occur. Because the global number of DJ users is low, I felt this was an acceptable risk. (I get the feeling this is part of the reason why clang and gcc mangle function names…)

LLAST

The code in llast.cpp and its header file define the interface for the new AST. I opted to design the new AST as a series of DJNodes, of which DJProgram and DJExpression are the main child classes. All other DJ expression types are child classes that inherit from DJExpression. I chose early on to not translate the original symbol tables, which turned out to be a small mistake; it would have made much of code generation more simple to have translated these data structures into C++ as well.

TranslateAST

The code in translateAST.cpp takes the validated AST produced by the typechecker and translates it into the LLAST.

Code Generation

The files codegen.cpp and codeGenClass.cpp contain DJExpression and DJProgram methods and related helper functions for code generation.

Classes and their methods

I implemented classes as structs, with their methods implemented as functions that take a pointer to a particular structure. I laid out structs in memory like this (for some example class C):

  1. pointer to C (the this pointer)
  2. class ID (this is the same as the class’ number in the classesST symbol table)
  3. any fields

This comes into play when accessing struct fields using LLVM’s notorious get element pointer instruction. To get the value of the first field declared in a struct, we would create a GEP using index (0,2), with the offset being for the this pointer and the class ID as layed out above.

VTables

Due to the restrictions of the LLVM IR type system, I implemented virtual dispatch tables as nine separate functions, one for each of the possible pairs of return type and parameter type that can occur in DJ. When a return type or parameter type is a class, I casted the value to one of Object type. I confirmed that this would have no effect on accessing class fields later; a cast back to the original type is all that it took. When inside a VTable function, a chain of if/then/else statements accomplishes the dispatch. Even for the two DJ class declarations below, the VTable becomes verbose:

class C1 extends Object {
  nat whoami(nat unused) { printNat(1); }
}

class C2 extends C1 {
  nat whoami(nat unused) { printNat(2); }
}
define i32 @natVTablenat(%Object* %0, i32 %1, i32 %2, i32 %3) {
entry:
  %4 = bitcast %Object* %0 to %C1*
  %5 = getelementptr %C1, %C1* %4, i32 0, i32 1
  %6 = load i32, i32* %5
  %7 = icmp eq i32 %1, 1
  %8 = icmp eq i32 %6, 1
  %9 = and i1 %7, %8
  %10 = icmp eq i32 %2, 0
  %11 = and i1 %9, %10
  br i1 %11, label %then, label %else

then:                                             ; preds = %entry
  %12 = call i32 @C1_method_whoami(%C1* %4, i32 %3)
  ret i32 %12

else:                                             ; preds = %entry
  %13 = bitcast %C1* %4 to %C2*
  %14 = getelementptr %C2, %C2* %13, i32 0, i32 1
  %15 = load i32, i32* %14
  %16 = icmp eq i32 %1, 1
  %17 = icmp eq i32 %15, 2
  %18 = and i1 %16, %17
  %19 = icmp eq i32 %2, 0
  %20 = and i1 %18, %19
  br i1 %20, label %then1, label %else2

then1:                                            ; preds = %else
  %21 = call i32 @C2_method_whoami(%C2* %13, i32 %3)
  ret i32 %21

else2:                                            ; preds = %else
  %22 = getelementptr %C2, %C2* %13, i32 0, i32 1
  %23 = load i32, i32* %22
  %24 = icmp eq i32 %1, 2
  %25 = icmp eq i32 %23, 2
  %26 = and i1 %24, %25
  %27 = icmp eq i32 %2, 0
  %28 = and i1 %26, %27
  br i1 %28, label %then3, label %else4

then3:                                            ; preds = %else2
  %29 = call i32 @C2_method_whoami(%C2* %13, i32 %3)
  ret i32 %29

else4:                                            ; preds = %else2
  ret i32 0

ITables

I implemented instanceof tables in a manner similar to virtual dispatch tables; namely, a mess of chained if/then/else statements. The type system was not a concern here, and so there is only one ITable function in the resulting IR.