lac-dcc/honey-potion

Potential collaboration

Closed this issue · 5 comments

First, congrats on the release!
I'm working on an Elixir project for MLIR. At this point it can generate and run LLVM IR with no problem. I'm curious if Honey Potion team will be interested in moving the codegen from C source to LLVM/MLIR. By the way it is also supported to export C source from MLIR via this
Here are some examples about what is it like: https://gist.github.com/jackalcooper/223eafe3f592c034d7eaa9fffdbcf44d

Hi Shenghang Tsai. The code generator looks great! But we would have to discuss that more: what would be the advantage of producing MLIR? One of the reasons we chose C was that it was very easy to link the programs that we produce with the eBPF support library. There are a few other advantages in using C, though. An important point is debugging: it is easier to read the C code than it would be to read LLVM IR or MLIR, I suppose.

But we would have to discuss that more: what would be the advantage of producing MLIR?

I think there are mainly these advantages:

  • maintainability. Usually generating IR should be more maintainable than generating source considering IR is more structured and formalized. Also you might verify generated IR to produce more informative error message etc.
  • write pass to transform/analysis for the code generated. For instance I notice you have some analysis on the code size?
  • different level of representation of the program. Before producing final target (C/LLVM), you might introduce different level of IR for different level of abstractions (MLIR's core idea, multi-level IR)
  • every thing in Elixir. Including codegen and higher level features.

producing MLIR?
we chose C was that it was very easy to link the programs that we produce with the eBPF support library

the final target is LLVM or C generated by MLIR. MLIR is part of LLVM so all LLVM components are available for integration. It is possible to generate a shared lib, object file, C source or run it as JIT-ed function.

It is easier to read the C code than it would be to read LLVM IR or MLIR

  • It is true. But it is worth trying I guess

Hi Shenghang Tsai,

First, congrats on the release!

Thank you! And thanks for your interest in a collaboration! 😄

maintainability. Usually generating IR should be more maintainable than generating source considering IR is more structured and formalized. Also you might verify generated IR to produce more informative error message etc.

I tend to agree with you that IR is more maintainable. A challenge that we are facing here relates to the semantic gap between Elixir, C and eBPF bytecodes. Although we know what an Elixir construct should look like in bytecodes in order to pass the verifier, writing that in C is challenging, because even if the C code is valid (i.e. should pass the verifier), after all of the optimizations and transformations, the generated eBPF code might still fail. Usually, we have to try lots of different semantically equivalent codes in C to get the bytecodes right.
I believe this trial and error approach is common when writing C code for eBPF. But I feel that in our case, this problem is aggravated since we are first translating Elixir to C, and constantly updating the translator. Every new feature we add might trigger a new set of optimizations that generate invalid bytecodes (i.e. that fail the verifier). When this happens, we have to refactor the translation of multiple constructs for them to work together again.

Maybe if we translate Elixir to the right IR, the semantic gap would be reduced. We could then design optimizations that better preserve the memory checks written by us, increasing the chances of the verifier accepting the code and reducing the problem of constantly refactoring the translator.

That said, there is much work already invested into the C backend, and we have been working for the past months on allowing the use of network-related program types. It would be a pity to throw all of this away right now.
Nonetheless, after this milestone, I believe it can be a good idea to run some experiments on the benefits of using MLIR on this project.

we have to try lots of different semantically equivalent codes in C to get the bytecodes right.
When this happens, we have to refactor the translation of multiple constructs for them to work together again

That's true. Generating source usually means lots of leaky abstraction that there are some convention/boilerplate at every level but not well defined and testable.

between Elixir, C and eBPF bytecodes

if we can generate LLVM IR (with eBPF intrinsic maybe), there are already tools to translate it to eBPF bytecodes right?

if we can generate LLVM IR (with eBPF intrinsic maybe), there are already tools to translate it to eBPF bytecodes right?

Yes. In fact, the most common way to compile C to eBPF is by using clang, which will first translate the source to LLVM IR.