Druid: Meta-interpretation for Just-In-Time compiler generation
The Druid project explores the automatic generation of machine code templates from bytecode interpreters using an abstract interpreter on the existing bytecode interpreter (a meta-interpreter). This approach could benefit from having a single runtime implementation and having a JIT compiler generated from it.
Context
JIT (Just-in-Time) compilers are an optimization technique often used for interpreted languages and virtual machines. They allow to spend time optimizing only frequently used code, while falling back in slower execution engines for non-frequent code. For example, the Pharo and the Java VM run on a bytecode interpreter and eventually compile machine code for methods that are frequently called.
Nowadays, the Pharo Virtual Machine is implemented in a subset of the Pharo language called Slang. The Virtual Machine developers then benefit from the high-level tools used to work with Pharo code, such as the code editors, testing frameworks and debuggers. In a later stage, the Virtual Machine code written in Slang is transpiled to C and then compiled to the target architectures.
The current Pharo JIT compiler that is part of the Virtual Machine, aka Cogit, implements an architecture based on templates of native code per bytecode. When a method is compiled, each bytecode is mapped to its corresponding template. All templates are concatenated to form a single machine code method. This architecture has as drawback that the behavior of the Pharo language is duplicated in both the bytecode interpreter and their corresponding machine code templates.
The Druid project explores the automatic generation of machine code templates from bytecode interpreters using an abstract interpreter on the existing bytecode interpreter (a meta-interpreter).
Getting started
Installing Dependencies
- You should also install the libraries unicorn and llvm.
- If you are on linux, use your preferred package manager.
- If you're on OSX, we recommend using homebrew.
- In windows, we should check.
Install Unicorn library for Pharo
# Unicorn 2 Setup for PharoVM
# Instructions at: https://github.com/pharo-project/unicorn/blob/pharo-vm-unicorn2/docs/COMPILE.md
git clone https://github.com/pharo-project/unicorn.git
cd unicorn
git checkout pharo-vm-unicorn2
mkdir build; cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make
Now replace possible already installed unicorn with the new compiled one, for example:
locate libunicorn.2.dylib
mv /opt/homebrew/Cellar/unicorn/2.0.1/lib/libunicorn.2.dylib /opt/homebrew/Cellar/unicorn/2.0.1/lib/libunicorn.2.old
cp libunicorn.2.dylib /opt/homebrew/Cellar/unicorn/2.0.1/lib
Open the Pharo image with the Unicorn project loaded and check:
UnicornLibrary uniqueInstance macModuleName
Installing Druid
Druid works for now on Pharo 11:
EpMonitor disableDuring: [
Metacello new
repository: 'github://Alamvic/druid:main';
baseline: 'Druid';
load ]
This will start the installation process, clone the dependencies, and load all necessary Pharo code.
(Prelude) An overview of the Pharo VM
The Pharo VM is a program that executes Pharo programs. Its main components are execution engine, and a memory manager. The execution engine is made of an interpreter (a bytecode interpreter), and a JIT compiler that compiles to machine code methods that are used often. The memory manager implements how objects are stored in memory, and an automatic reclamation of memory (usually called garbage collector).
The Pharo VM is implemented in a subset Pharo itself called Slang. Thus, we can use Pharo tools to execute, test and debug the VM in a "simulated" manner. To produce a productive VM the Slang Pharo code is transpiled to C and then compiled in the current architecture.
Druid works on the Slang part of the VM, and its scope does not touch (for now) the C-code generation.
(Prelude) Bytecodes and a stack machine
Pharo methods are written as bytecodes and primitives.
For example, the following method is compiled as a method made of platform independent bytecode and an array of literals (also called literal frame) used in that method.
The bytecodes are virtual machine code instructions, the literals are the objects that represent fixed values used in that method.
For example, 1 and 17 are literals in this method. Besides numbers, other kind of literals are 'strings'
, literal arrays such as #(a b c 1)
and characters $A
.
MyClass >> foo
^ 1 + 17
We can inspect the method above MyClass >> foo
and see its bytecodes and literals.
In the list below the first number is the bytecode index, the second is the instruction bytes, and then there is a mnemonic of the instruction and arguments.
25 <76> pushConstant: 1
26 <20> pushConstant: 17
27 <B0> send: +
28 <7C> returnTop
It is important to understant that the Pharo bytecode is based on a stack machine (in contrast with register machines).
This means that most of the data between operations is exchanged through a stack.
In the example above, the first bytecode pushes a 1 into the stack and the second bytecode pushes a 2.
Then, the third bytecode has to send a message +
, so it pops two elements from the stack: one to use as the receiver, and one to use as an argument.
When the execution of the message send finishes, the result is pushed to the stack.
Finally, the last bytecode takes the top of the stack (the result of the addition), and returns it to the caller.
(Prelude) Understanding the bytecode interpreter
When a bytecode method is executed by the interpreter, it iterates all bytecodes of a method and executes a VM routine for each of them.
The class implementing the bytecode interpreter is StackInterpreter
.
For example, the pushConstantOneBytecode
is the routine that pushes a 1 to the stack calling the internalPush:
method.
Since pushing the value 1 is a very common operation, a special bytecode is used for it to avoid putting the 1 in the literal frame.
StackInterpreter >> pushConstantOneBytecode
self fetchNextBytecode.
self internalPush: ConstOne.
The method pushLiteralConstantBytecode
pushes a generic literal value to the stack also using internalPush:
.
The value pushed is taken from the literal frame of the method, and the index is calculated from manipulating the currentBytecode
variable.
Bytecode 33 pushes the first literal in the frame (33 bitAnd: 16r1F => 1), bytecode 34 pushes the second literal, and so on...
StackInterpreter >> pushLiteralConstantBytecode
<expandCases>
self
cCode: "this bytecode will be expanded so that refs to currentBytecode below will be constant"
[self fetchNextBytecode.
self pushLiteralConstant: (currentBytecode bitAnd: 16r1F)]
inSmalltalk: "Interpreter version has fetchNextBytecode out of order"
[self pushLiteralConstant: (currentBytecode bitAnd: 16r1F).
self fetchNextBytecode]
Finally, bytecodes such as the message send +
are implemented as follows.
First this bytecode gets the top 2 values from the stack.
Then it checks if boths are integers, and if the result is an integer, in which case it pushes the value and finishes.
If they are not integers, it tries to add them as floats.
If that fails, it will perform a (slow) message send using the normalSend
method.
StackInterpreter >> bytecodePrimAdd
| rcvr arg result |
rcvr := self internalStackValue: 1.
arg := self internalStackValue: 0.
(objectMemory areIntegers: rcvr and: arg)
ifTrue: [result := (objectMemory integerValueOf: rcvr) + (objectMemory integerValueOf: arg).
(objectMemory isIntegerValue: result) ifTrue:
[self internalPop: 2 thenPush: (objectMemory integerObjectOf: result).
^ self fetchNextBytecode "success"]]
ifFalse: [self initPrimCall.
self externalizeIPandSP.
self primitiveFloatAdd: rcvr toArg: arg.
self internalizeIPandSP.
self successful ifTrue: [^ self fetchNextBytecode "success"]].
messageSelector := self specialSelector: 0.
argumentCount := 1.
self normalSend
(Prelude) Understanding the existing Cogit JIT compiler
When a bytecode method is executed a couple of times, the Pharo virtual machine decides to compile it to machine code. Compiling the method to machine code avoids performance overhead due to instruction fetching, and allows one to perform several optimizations. The compilation of a machine code method goes pretty similar to the interpretation of a method. The JIT compiler iterates the bytecode method and for each of the bytecodes it executes a code generation routine. This means that we will (almost) have a counterpart for each of the VM methods implementing bytecode interpretation.
For example, the machine code generator implemented for StackInterpreter>>pushLiteralConstantBytecode
is Cogit>>genPushLiteralConstantBytecode
.
Cogit >> genPushLiteralConstantBytecode
^self genPushLiteralIndex: (byte0 bitAnd: 31)
StackToRegisterMappingCogit >> genPushLiteralIndex: literalIndex "<SmallInteger>"
"Override to avoid the BytecodeSetHasDirectedSuperSend check, which is unnecessary
here given the simulation stack."
<inline: false>
| literal |
literal := self getLiteral: literalIndex.
^self genPushLiteral: literal
The JIT'ted version of the addition bytecode (genSpecialSelectorArithmetic
) is slightly more complicated, but it pretty much matches what it is done in the bytecode.
Overview of Druid
In Druid, a meta-interpreter analyzes the bytecode interpreter code and generates an intermediate representation from it. A compiler interface then generates machine code from the intermediate representation. The output of the intermediate representation should have in general terms the same behaviour as the existing Cogit JIT compiler.
To verify the correctness of the compiler we use:
- a machine code simulator (Unicorn)
- a disassembler (llvm)
(please see the links to these projects in the references)
The (meta-)interpreter
The setup is the following: we have one Pharo AST interpreter that we call the meta-interpreter that executes the code code of the
StackInterpreter
and generates the corresponding intermediate representation of StackInterpreter
methods.
Check Fun with interpreters
from the references to see more details on what an ASTs and abstract interpreters are.
In the code below, the meta-interpreter is called DRASTInterpreter
(for the DruidAST interpreter) and it will analyse the methods
of the stack interpreter returned byt the expression Druid new newBytecodeInterpreter
.
For this task, the meta-interpreter uses an IR builder that is responsible for encapsulating the logic of IR building.
builder := DRIRBuilder new.
builder isa: #X64.
astInterpreter := DRASTInterpreter new.
astInterpreter vmInterpreter: Druid new newBytecodeInterpreter.
astInterpreter irBuilder: builder.
This AST interpreter then receives as input a list of bytecodes to analyze, it maps each bytecode to the routine to execute, obtains the AST and interprets each of the instructions of the AST using a visitor pattern.
astInterpreter interpretBytecode: #[76].
Visiting the AST
The DRASTInterpreter
class implements a visiting
protocol where the visit
methods are grouped.
Most of the visit methods are simple, like the following ones:
DRASTInterpreter >> visitSelfNode: aRBSelfNode
^ self receiver
DRASTInterpreter >> visitTemporaryNode: aRBTemporaryNode
^ currentContext temporaryNamed: aRBTemporaryNode name
Context reification
To properly analyze the scope of temporary variables, the AST interpret reifies the contexts/stack frames. On each method or block activation, a new context is pushed to the stack. The temporary variables are then read and written from the current context. When a method or block returns, the current context is popped from the stack, to return to the caller context.
Interpreting Message sends
The most important operation in a Pharo program are message sends.
Message sends are used not only for normal method invocations, but also for common operators such as additions (+
) and multiplications (*
) and control flow such as conditionals (ifTrue:
) and loops (whileTrue:
).
However, some of these special cases require special treatments when generating code.
For example, a multiplication should directly generate a normal multiplication and not require interpreting how that multiplication is implemented.
To manage such special cases, we keep a table in the interpreter that maps (special selector -> special interpretation). When we find a message send in the AST, we lookup the selector in the table. If we find a special case in the entry, we invoke that special entry in the interpreter. Otherwise, we lookup the method and activate the new method, which will recursively continue the interpretation
DRASTInterpreter >> visitMessageNode: aRBMessageNode
| arguments astToInterpret receiver |
"First interpret the arguments to generate instructions for them.
If this is a special selector, treat it specially with those arguments.
Otherwise, lookup and interpret the called method propagating the arguments"
receiver := aRBMessageNode receiver acceptVisitor: self.
arguments := aRBMessageNode arguments collect: [ :e | e acceptVisitor: self ].
specialSelectorTable
at: aRBMessageNode selector
ifPresent: [ :selfSelectorToInterpret |
^ self perform: selfSelectorToInterpret with: aRBMessageNode with: receiver with: arguments ].
astToInterpret := self
lookupSelector: aRBMessageNode selector
receiver: receiver
isSuper: aRBMessageNode receiver isSuper.
^ self interpretAST: astToInterpret withReceiver: receiver withArguments: arguments.
For example, the following code snippet shows how the internalPush:
is mapped as just a normal push instruction in the intermediate representation.
DRASTInterpreter >> initialize
super initialize.
irBuilder := DRIRBuilder new.
specialSelectorTable := Dictionary new.
...
specialSelectorTable at: #internalPush: put: #interpretInternalPushOn:receiver:arguments:.
DRASTInterpreter >> interpretInternalPushOn: aRBMessageNode receiver: aStackInterpreterSimulatorLSB arguments: aCollection
^ irBuilder push: aCollection first
The intermediate representation
The intermediate representation is generated by a builder object, instance of DRIRBuilder
.
The AST interpreter collaborates with it to create instructions, new basic blocks and so on.
The intermediate representation that is created is somewhat inspired on a low-level intermediate representation as described in [Linear Scan Register Allocation for the Java HotSpot™ Client Compiler].
Some meta-interpretation of special cases
Not all special cases generate instructions or interact with the DRIRBuilder.
Some of them actually modify the state of the AST interpreter.
For example, a special case is the fetchNextBytecode
instruction, that makes the interpreter move to the next byte in the list of bytecodes.
To simulate the same behaviour in our interpreter, its special case is implemented as follows:
DRASTInterpreter >> interpretFetchNextBytecodeOn: aMessageSendNode receiver: aReceiver arguments: arguments
self fetchNextInstruction
DRASTInterpreter >> fetchNextInstruction
currentBytecode := instructionStream next
The compiler interface
Once the interpreter finishes its job, the irBuilder will contain the intermediate representation instructions.
We can then generate machine code from them using the DRIntermediateRepresentationToMachineCodeTranslator
class.
astInterpreter irBuilder assignPhysicalRegisters.
mcTranslator := DRIntermediateRepresentationToMachineCodeTranslator
translate: builder instructions
withCompiler: cogit.
address := cogit methodZone freeStart.
endAddress := mcTranslator generate.
DRIntermediateRepresentationToMachineCodeTranslator
iterates all the instructions and uses the double-dispatch pattern to
generate code for each of them.
It uses a backend to generate the actual machine code.
aCollection do: [ :anIRInstruction |
anIRInstruction accept: self ].
DRIntermediateRepresentationToMachineCodeTranslator >> visitAdd:
DRIntermediateRepresentationToMachineCodeTranslator >> visitLoad:
DRIntermediateRepresentationToMachineCodeTranslator >> visitPush:
About the Tests
The whole process has the following steps:
- The bytecode interpreter is interpreted to generate the instructions in the IRBuilder
- Then the physical registers will be allocated to the Druid instructions
- The DRInstructions generates the AbstractInstructions (Cogit) that basically are one to one representations with the machine code
- Compilation to machine code byte array
So, we have different tests to test different stages in the transformation:
- The class
DRIRBuilderTest
tests the construction of the ir. - The class
DRASTInterpreterTest
tests the effect of interpreting ASTs - The DRIntermediateRepresentationToMachineCodeTranslatorTest simulates that each DR instrucion is correctly translated to the corresponding machine code
- The class
DRSimulateGeneratedBytecodeTest
tests from end-to-end the generation of code of a bytecode and its execution in a machine code simulator
Other test classes test a basic register allocation algorithm (DRRegisterAllocationTest
), the generation of machine code from an IR without passing through the AST (DRIntermediateRepresentationToMachineCodeTranslatorTest
) and the execution of machine code from an IR without passing through the AST (DRSimulateGeneratedCodeTest
).
Little exercises
- In tbe book https://github.com/SquareBracketAssociates/PatternsOfDesign/releases
- Chapter 4: Die and DieHandle double Dispatch (if you want to make sure that Double Dispatch has been understood do the Stone Paper Scissor Chapter)
- Chapter 3 A little expression interpreter
- Chapter 6 Understanding visitor
- After reading https://github.com/SquareBracketAssociates/Booklet-FunWithInterpreters
References
-
Linear Scan Register Allocation for the Java HotSpot™ Client Compiler http://www.ssw.uni-linz.ac.at/Research/Papers/Wimmer04Master/
-
Practical partial evaluation for high-performance dynamic language runtimes https://dl.acm.org/doi/10.1145/3062341.3062381
-
Structure and Interpretation of Computer Programs http://web.mit.edu/alexmv/6.037/sicp.pdf
-
Fun with Interpreters https://github.com/SquareBracketAssociates/Booklet-FunWithInterpreters/releases/download/continuous/fun-with-interpreters-wip.pdf
-
Paper explaining the motivations behind Sista Bytecode https://github.com/SquareBracketAssociates/Booklet-PharoVirtualMachine/raw/master/bib/iwst2014_A%20bytecode%20set%20for%20adaptive%20optimizations.pdf