/beancd-lang

The beancd programming language

The beancd programming language

       _                              _ 
      | |                            | |
      | |__   ___  __ _ _ __   ___ __| |
      | '_ \ / _ \/ _` | '_ \ / __/ _` |
THE   | |_) |  __/ (_| | | | | (_| (_| |  LANGUAGE
      |_.__/ \___|\__,_|_| |_|\___\__,_|

              
         ---- the beancd language ----

-beancd- (pronounced beancode) is a simple yet complete compiler for the -beancd language-, a java-like object-oriented programming language with an IA-32 asm backend.

This is a learning project to gain practical experience in designing and implementing a compiler (5k loc). I built it for the course Compiler Design taught at ETH Zurich. Beancd covers the main steps in a compiler pipeline including;

  • Lexical analysis
  • Syntax analysis
  • AST intermediate representation
  • Semantical analysis
  • AST optimizations, simple CFG and Data-Flow analysis
  • IA-32 assembly backend

Beancd is writtin in Java and compiles the beancd language to IA-32 assembly.

Language Features

Some of beancds main language feature include;

  • Object oriented
  • Single inheritance
  • Statically typed
  • static scoping: global, class, method scope
  • support for expressions and statements
  • no constructors, no abstract classes
  • supports primitive datatypes boolean, and int
  • Arrays are not co-variant, super of all Arrays is Object
  • built-in methods for read(), write(expr) and writeln()
  • runtime errors for: downcast, array store, out-of-bound, null pointer
  • syscalls are delegated to libc functions

tldr; Hello World aka print 1337

input: beancd

class Main {
    void main() {
        write(1337);
    }
}

output: ia-32

# === WRITELN CONFIG SECTION
.section .data
STR_NL:
  .string "\n"
STR_D:
  .string "%d"

# === VTABLE SECTION
.section .data
___vtable_boolean:
  .long 0xB00B5BAE                               #     0: ___super_vptr

___vtable_Object:
  .long 0xB00B5BAE                               #     0: ___super_vptr

___vtable_Main:
  .long ___vtable_Object                         #     0: ___super_vptr
  .long ___Main_main                             #     4: main

___vtable_int:
  .long 0xB00B5BAE                               #     0: ___super_vptr

.section .text
__rte_array_oob:
  subl $12, %esp
  movl $3, 0(%esp)
  calll exit
  addl $12, %esp

__rte_null:
  subl $12, %esp
  movl $4, 0(%esp)
  calll exit
  addl $12, %esp

__rte_array_size:
  subl $12, %esp
  movl $5, 0(%esp)
  calll exit
  addl $12, %esp

__rte_divby0:
  subl $12, %esp
  movl $7, 0(%esp)
  calll exit
  addl $12, %esp

# === MAIN SECTION
.section .text
.globl main
main:
  pushl  %ebp
  movl %esp, %ebp
  subl $12, %esp
  # Bootstrap baby, bootstrap
  subl $16, %esp
  movl $8, %esi
  subl $12, %esp
  movl %esi, 4(%esp)
  movl $1, 0(%esp)
  call calloc
  movl %eax, %edi
  add $12, %esp
   # ref to vtable setup, ref: $___vtable_Main
  movl $___vtable_Main, 0(%edi)
  # object meta data setup
  movl $0, 4(%edi)
  subl $12, %esp # 12 byte padding before call
  movl %edi, 0(%esp)
  calll ___Main_main
  addl $12, %esp
  addl $16, %esp
  addl $12, %esp
  popl %ebp
  xor %eax, %eax
  retl

# class Main {...}
  # void main(...) {...}
  # === BEGIN Method Main#main
  # callee stack layout for ___Main_main
  #    8(%EBP): TARGET_OBJ      (  4 bytes) n/a
  #    4(%EBP): RETURN_ADDR     (  4 bytes) n/a
  #    0(%EBP): OLD_BP          (  4 bytes) n/a
  #   -4(%EBP): PADDING16       ( 12 bytes) n/a
___Main_main:
  pushl  %ebp
  movl %esp, %ebp
  subl $12, %esp
  movl $1337, %edi
  subl $0, %esp
  sub $12, %esp
  movl %edi, 4(%esp)
  movl $STR_D, 0(%esp)
  call printf
  add $12, %esp
  addl $0, %esp
  # method clean up
  movl %ebp, %esp
  popl %ebp
  ret

Specification

Beancd is a reference implementation of the language specification JavaLi. JavaLi was specified for the course Compiler Design taught by Prof. Thomas Gross at ETH Zurich (2017).

See the specification at TODO

Design Notes

Stack Layout

Upon a function call a stack frame is allocated. Stack frames headers have the following layout;

# +00: RETURN_VAL      
# -04  METHOD_ARG      right most argument first
# -08: METHOD_ARG      another argument
# -12: TARGET_OBJ      reference to this object
# -16: RETURN_ADDR     
# -20: OLD_BP          
# -24: LOCAL_VAR       for each local variable, one such entry exists
# -28: PADDING16       padding for 16 byte alignment

  • Padding entries ensure 16 byte alignment (required to be compatible with IA-32 linux ABI to call gcc functions)
  • The RETURN_VAL entry is only present if the function has a return value.
  • LOCAL_VAR refers to local (stack) variables declared in the scope of a method.

Upon a new stack frame %EBP is pushed on the stack (referred as OLD_BP) and %ESP is stored in %EBP.

Memory Layout

Beancd supports boolean, integer primitive data types, class data types, and array data types of primitive or class data types.

integer
4 bytes
boolean
4 bytes (simplifies implementation)
pointer
4 bytes
classes
8 byte header
arrays
16 byte header

Classes

Classes have a memory footprint of at least 8 bytes.

Class Header

+00: pointer to vtable
+04: meta data

Arrays

Arrays have a memory footprint of at least 16 bytes.

Array Header

+00: pointer to vtable
+04: meta data
+08: number of elements
+12: element size

Virtual Tables

Virtual tables are generated in the assembly prologue for each primitive and class data type. The first 4 bytes of a virtual table contains a reference to the super type. These references are used for inheritance. Primitive data types do not have a super type which is why the magic number 0xB00B5BAE is used to indicate no super type. The class Object does not have a super type either and also uses the same magic number.

All classes (including arrays) are sub-classes of the special class Object. In the vtable dump below, class ___vtable_Main contains a reference to the virtual table of special class Object.

.data
___vtable_boolean:
  .long 0xB00B5BAE                               #     0: ___super_vptr       

___vtable_Object:
  .long 0xB00B5BAE                               #     0: ___super_vptr       

___vtable_Main:
  .long ___vtable_Object                         #     0: ___super_vptr       
  .long ___Main_myFunction                       #     4: myFunction          
  .long ___Main_main                             #     8: main                

___vtable_int:
  .long 0xB00B5BAE                               #     0: ___super_vptr

Runtime Error Detection

The assembly prologue also contains implementations of runtime errors. Runtime error labels are prefixed with __rte_ (as in Run Time Error). Runtime errors abort execution with an error code and call the glibc function exit.

Runtime Assembly

    .text 
__rte_array_oob:
    subl  $12, %esp
    movl  $3, 0(%esp)
    calll _exit
    addl  $12, %esp
__rte_null:
    subl  $12, %esp
    movl  $4, 0(%esp)
    calll _exit
    addl  $12, %esp
__rte_array_size:
    subl  $12, %esp
    movl  $5, 0(%esp)
    calll _exit
    addl  $12, %esp
__rte_divby0:
    subl  $12, %esp
    movl  $7, 0(%esp)
    calll _exit
    addl  $12, %esp

Implemented Errors

Error 01
Invalid downcast
Error 02
Invalid array allocation
Error 03
Array out of bounds
Error 04
Null pointer
Error 05
Invalid array size on allocation
Error 12
Division or Modulo 0.

Optimizations

General Assembly Optimizations

  • For binary expressions, fewer assembly instructions are generated if one or both of the two arguments can be determined at compile time. Specifically, the fact that the instruction cmpl allows one argument to be an immediate value is exploited.
  • A load of a constant value (Int/Boolean) is omitted and immediate values are used instead where possible.
  • Nullchecks are not performed for ThisRef nodes.
  • Fewer instructions for branching between CfgCodeGenerator blocks.
  • If arguments for method calls can be determined at compile time, immediate values are pushed onto stack directlly without moves into registers.

ConstantFoldingOpt.java

  • Uses DataFlowAnalysis for constant propagination.

ArithmeticPropertyOpt.java

  • Exploits arithmetic properties in multiplications or modulo operations. For instance, a multiplication by 0 is always zero. In case the other argument is not a method call (with side effects), the expression is evaluated to 0 at compile time.

BinaryOperationOpt.java

  • Exploits commutativity of multiplication and addition if constants are involved.

StaticConditionOpt.java

  • Removes conditions and branching if condition value can be determined at compile time.

VarOverwriteOpt.java

  • Removes assign statements of variables whose values are not used and reassigned afterwards.

EmptyBlockCleanup.java

  • Removes empty/nop cfg graph nodes.

Further improvements (not enabled/ not implemented)

InlineMethodAnalysis.java

  • Removes method invocation if method returns constant values and replaces invocation by those values. This optimization is currently not enabled as it needs further testing and more advanced implementations for stronger optimizations.

Implement null check analysis

  • Remove unecessary null checks, not implemented

Remove assignments in Loops whose value does not dependent on iteration

  • Not implemented

More Examples

Source Language

class Main {
  void main() {
    int res;
    res = myFunction(1, 2);
    write(res);
    writeln();
  }
  int myFunction(int arg1, int arg2) {
    int local;
    local = 1337;
    return local + arg1 + arg2;
  }
}

Assembly