nominolo/lambdachine

Hacking/helping

Opened this issue · 23 comments

I am creating a ticket since there doesn't seem to be any other way to interact about this project (what about a google groups?).

I find this an extremely interesting project for the Haskell ecosystem and would like to help (to be honest I will not be able to contribute at first to the inner workings of the Tracing JIT but I am sure there are other things around that I can help).

Is there an high level overview of the architecture. Here, I am looking for some practical information such as what are the GHC / GHC API touch points (present and planned), what are the existing patches on top of GHC and what are the artifacts of the build process, etc.

In addition, any pointers on how you go about debugging the code would also be helpful.

UPDATE: Sorry, just saw instructions on how to build.

Also interested.

Thanks guys. I'm pretty sure the instructions are not up to date anymore. I have another branch for GHC 7.6 which I'll merge tomorrow and I'll take another look at the instructions. Biggest issue ATM is that you need a custom-built GHC 7.0 (for current master) or GHC 7.6 which is compiled with the integer-simple library instead of the default integer-gmp. On recent OS X, GHC 7.0 cannot be built from source anymore (issues with GCC=Clang) but the 7.6 port is missing some libraries. The runtime (i.e., the part that contains the JIT) is independent of GHC, but you can't produce the bytecode files without GHC.

I just tried to build the ghc-7.6 branch and got the following error after following the build instructions:

vm/memorymanager.hh:10:10: fatal error: 'tr1/unordered_set' file not found
#include <tr1/unordered_set>

So the JIT is pure c/c++? Curiosity: is that better than Haskell for this or are there other benefits/constraints (e.g. maybe most of this come from the LuaJit impl)?

I also saw some movement in the sources from GHC Core to STG. I am curious about the motivation.

Also, how does the compilation work? It generates byte code which would be the equivalent to the .o/.a that is currently generated, correct? What about packaging and linking: do you have specific thoughts there? Would the byte code be dynamically linked too?
(sorry many questions at this point and I have had little time to really get the know the code base).

I'm not using many C++ features. I mostly felt that it would make certain abstractions nicer. I'm using C++'s unordered sets simply because I didn't want to spend time on implementing my own data structures. It looks like the ./configure stuff doesn't figure out the right version on your system. Try changing tr1/unordered_set to just unordered_set. You'll probably also have to adjust a few #defines, too.

A lot of the code generation stuff is copied or adapted from LuaJIT, but I chose some more verbose names for readability. All that could use some clean-up and documentation.

The switch from Core to STG was unfortunately necessary. In GHC 7.0, Core didn't allow unboxed tuples as function arguments, only as result. Max Bolingbroke lifted that restriction, but he added a magic pass that turns functions with unboxed tuples as arguments into functions taking multiple arguments. That pass is untyped and therefore only works with STG. That said, I think the STG->Bytecode bit is actually simpler. Before I was using CorePrep, and STG is really just CorePrep without types. I used CorePrep, because I thought I might need the type information, but actually dealing correctly with types was actually very messy, so I think STG is the right level.

The bytecode compiler generates .lcbc files where GHC would generate .o files. The bytecode loader automatically searches for these files (based on the module name) in the directories given by the -B option. There are no packages, yet, and the loader is very simple. Adding a proper bytecode packaging format (ala .jar files) wouldn't be too difficult, but it would be a fair amount of work and it's not a priority at this stage.

Thanks for the pointers! I will try to get past this compilation issue (I am on Mavericks XCode/clang 5.0 and gcc 4.2 installed using brew - it was able to compile ghc 7.6 from sources) and see how far I get.

I didn't have time to follow up on this so far (hope to have soon).

In any case, wondering if you saw this: https://github.com/franko/luajit-lang-toolkit

And was wondering if your project could benefit from it somehow.

Yes, I saw that. But it didn't exist (or at least I didn't know about it) when I started Lambdachine (2010, I think) and there's also the issues that LuaJIT is optimised for Lua's semantics (Mike Pall re-iterated that a few times in a number of places on the web). Since Haskell's semantics and runtime behaviour is quite different, this wouldn't work too well once you want to go past the prototype stage.

I would be extremely interesting in hacking on the interpreter. I've implemented my own high performance interpreter using the threaded code method (like LuaJIT), but looking at lambdachine in its current state there's quite a bit of expensive code in the instruction dispatch. Maybe you'd welcome some help slimming down the interpreter?

Edit: Now I'm a little confused. BCi.S in the top level isn't being used, is it? The actual interpreter is in rts/inter* ?

The actual interpreter is in vm/capability.cc. Yeah, I'm going to do some spring clean-up, soon.

Most of porting the interpreter to assembly should be straightforward except for handling overapplication and partial applications, the current handling of this is very complicated. I'm not particularly happy with the current design of the CALL* instructions, particularly the encoding of GC info. I'd also like to use LuaJIT's register-window call technique, but that requires a more sophisticated bytecode register allocator. I haven't spent much on optimising the interpreter, because the instruction encoding was changing a bit too much. I'd also like to decouple the on-disk bytecode format from the runtime bytecode format. (For better load times one could always supply a install-time optimisation tool like Android.)

That said, except for CALL*/EVAL type instructions the current format should stay stable. So please go ahead and start a branch. You may want to consider using LuaJIT's DynASM.

Is it really necessary to break out into assembler? I was able to get
really good code from gcc and clang by using the labels as values extension
and having some locals at the top of the block which I used freely between
the labels (which ended up in registers).

On Wed, Mar 5, 2014 at 11:53 PM, Thomas Schilling
notifications@github.comwrote:

The actual interpreter is in vm/capability.cc. Yeah, I'm going to do some
spring clean-up, soon.

Most of porting the interpreter to assembly should be straightforward
except for handling overapplication and partial applications, the current
handling of this is very complicated. I'm not particularly happy with the
current design of the CALL* instructions, particularly the encoding of GC
info. I'd also like to use LuaJIT's register-window call technique, but
that requires a more sophisticated bytecode register allocator. I haven't
spent much on optimising the interpreter, because the instruction encoding
was changing a bit too much. I'd also like to decouple the on-disk bytecode
format from the runtime bytecode format. (For better load times one could
always supply a install-time optimisation tool like Android.)

That said, except for CALL*/EVAL type instructions the current format
should stay stable. So please go ahead and start a branch. You may want to
consider using LuaJIT's DynASM.

Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-36739030
.

Also, some psuedo for a crude bump pointer PAP thing which I hacked
together a long time ago which might give you some ideas.
https://gist.github.com/kvanberendonck/8328504

1 branch per partial application, but makes a lot of assumptions. It
probably doesn't help, or too naive (or maybe it needs explaining), but it
won't hurt anyone if I leave it here.

On Wed, Mar 5, 2014 at 11:53 PM, Thomas Schilling
notifications@github.comwrote:

The actual interpreter is in vm/capability.cc. Yeah, I'm going to do some
spring clean-up, soon.

Most of porting the interpreter to assembly should be straightforward
except for handling overapplication and partial applications, the current
handling of this is very complicated. I'm not particularly happy with the
current design of the CALL* instructions, particularly the encoding of GC
info. I'd also like to use LuaJIT's register-window call technique, but
that requires a more sophisticated bytecode register allocator. I haven't
spent much on optimising the interpreter, because the instruction encoding
was changing a bit too much. I'd also like to decouple the on-disk bytecode
format from the runtime bytecode format. (For better load times one could
always supply a install-time optimisation tool like Android.)

That said, except for CALL*/EVAL type instructions the current format
should stay stable. So please go ahead and start a branch. You may want to
consider using LuaJIT's DynASM.

Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-36739030
.

I may have invented something new by accident which could potentially drastically increase the performance of even LuaJITs interpreter by taking the dispatch onto the CPU. I'll give you more info as it comes. The general idea is to produce a large buffer of opcode handler addresses ahead of time. By pointing rsp to this buffer, with each handler ending in a ret instruction, the cpu can predict it as a function return (and without using lea) and conveniently, on x86, this also means decrementing rsp automatically.

This may be superior to a call-splat approach (context threading) as it doesn't require another decoder window for the calls as described in "Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters". It could potentially perform better than threaded interpreters because the CPU does a good job of predicting ret since it's predictable and non reliant.

ie

opcodes:

&foo
&bar
&foo // ... each ret moves this upwards
&bar // <-- starting 

vm_exec:
// decode instructions and put handler function ptrs in opcodes ahead of time
mov rsp, opcodes
add rsp, size
ret

foo:
code code code
ret

bar:
code code code
ret

What happens if an interrupt occurs? The CPU will push the values of the registers onto the stack which will overwrite your opcode chain. It was a nice idea though. :)

LuaJIT's interpreter may not be as fast as context threading (though it probably is most of the time), but it's very flexible. You can change the semantics of an instruction just by swapping out the pointer to the code section that implements it. This is great for switching between normal mode and trace recording mode. Mike Pall wrote it in assembly to optimise for the CPU pipeline as much as possible.

Is it even allowed to do that? I would imagine that would break all kinds of stack optimizations including fancy stuff done for exception handling which I've noticed gets emitted from msvc on negative sp slots.

Oh well :(

OK, not 100% sure. It might be that it is guaranteed to switch to the thread's kernel stack first and then pushes stuff there. So, in that case it would mainly confuse GDB and Valgrind.

I'll ask StackOverflow because I want to implement this myself if not.

I made a prototype implementation and benchmarked this about a month ago.

The CPU actually has a return address cache which drastically decreases performance if you pull it out of alignment (by changing rsp). rets are only fast if they're in the cache, otherwise the whole process is only as fast as using plain old jmps.

This is actually documented in the AMD developer manuals. Oh well.

I see that @kvanberendonck has fixed the tr1 issue with the build: thanks! The build now compiles all the way until it hits the need for integer-simple, which is my fault since I am using a stock GHC 7.8.3. I will fix that.

Build aside, I see a recurring pattern here: copy&paste and hacking base libraries to make it suitable for lambdachine. This also happens in Haste and GHCJS and seems to be an anti-pattern caused by GHC's base in need of a refactoring (and there is an effort for it: split-base). In an ideal world, I would expect lambdachine to simply re-use most of the base libraries as-is and plug a few that are lambdachine specific. What were the main reasons for reimplementing some of the base packages?

EDIT: btw, would it make sense to create an IRC channel for a more interactive discussion?

I modified base because I don't support some language features, yet. Specifically, calling C functions, which is obviously used a lot throughout the base package. I'm adding that now as I need it now to support GMP.

I registered #lambdachine on Freenode. I don't check my IRC client too often, though, and my notification setup is currently broken. I'll see if I can fix it.

May be worth your time talking to erikd (m3ga) because he ported GMP to
pure Haskell. Faster in some spots, only 50% or so slower in others, and it
would be amazing to be able to see how it interacts with the tracing.
On 7 Oct 2014 23:47, "Thomas Schilling" notifications@github.com wrote:

I modified base because I don't support some language features, yet.
Specifically, calling C functions, which is obviously used a lot throughout
the base package. I'm adding that now as I need it now to support GMP.

I registered #lambdachine on Freenode. I don't check my IRC client too
often, though, and my notification setup is currently broken. I'll see if I
can fix it.


Reply to this email directly or view it on GitHub
#12 (comment).

Well, the goal is to be able to build it with an out-of-the-box GHC, so you don't have to build a custom GHC version. I agree, that it could be interesting to see how the JIT could make this fast in the common case, but the goal right now is to make it easier for contributors and to make it more useful out of the box by supporting more base stuff and other libraries.

@nominolo I see re: language features not supported. I think any specific requirements and needs from lambdachine should feed back into the split-base effort. As you said, it would be great if this could be used with an out-of-the-box GHC.

I was also thinking to what extent lambdachine could also be the foundation for a next generation GHCi. One that is much faster than the current one and so avoid compilations entirely for faster development cycles.