ASSERT-KTH/spork

Applicability to other languages?

Opened this issue · 6 comments

Just curious: Are the methods that you're using with Spork applicable to other languages (with their own AST backends for GumTree, like srcML for C++)? Or is there something unique to Spoon and/or Java that enables what you're doing here?

Hi!

Short answer: the concept of structured merging is essentially applicable to anything that can be represented as a graph (for a programming language that'll typically be an AST, but it doesn't have to be). The general methods used in Spork are no exception.

Longer answer:

In principle, these techniques can be applied to any programming language, given a sufficiently powerful AST representation. Really, they can be applied to anything that can be represented as a tree, it doesn't have to be a programming language at all. The merge algorithm I've implemented (a modified version of 3DM (link)) is a generalized tree merging algorithm that just requires matchings between tree nodes to operate, and tree matching is a concept that generalizes to any form of tree.

The benefits from applying a structured merge may however differ between languages. For example, in this paper (link), the authors found that semistructured merge (which is similar to structured merge) was applicable to JavaScript, but the improvements over unstructured merge were much less pronounced than they were when using the same technique for Java. Java appears to benefit tremendously from semistructured and structured merge, possibly due to its rather inflexible syntax and clearly delimited syntactical levels.

That being said, there is structured merging for other languages. For example, GrammaTech (link) has a commercial tool that incorporates structured (AST-based) merge for JavaScript.

Spork is itself very much tied to Java, both because of using Spoon and because of incorporating some language semantics into the merge (e.g. methods being unordered).

Yeah, I had just checked out MergeResolver before this, it seems very interesting (especially the data mining part), but unfortunately it's proprietary and only does JavaScript.

Mainly, I was thinking that something like this would be very useful for downstream versions of large projects, like https://github.com/Eloston/ungoogled-chromium , which have a large amount of patches that have to be updated and merged in every new version. That's mainly why I was curious about other languages.

It's good to know that the general method you're using here is adaptable. But I'd probably want to start with another project from scratch for other languages instead of modifying Spork though, right?

I had been thinking a lot about the potential of AST merging when I stumbled across this, it looks really cool.

But I'd probably want to start with another project from scratch for other languages instead of modifying Spork though, right?

Probably, at least mostly. Much of Spork relies heavily on Spoon, which for obvious reasons locks it to Java. I just didn't have the time to make it all entirely generic. The implementation of 3DM in Spork (in the base3dm subpackage) is however very generic, and does not rely on any particular AST. So if you implement your merge in Java or Kotlin, then you can probably just copy/paste that package.

My master thesis contains a ton of detailed information, both theory and practice. Most of the information is not Java-dependent, so much of it should be of use to you. There are also some important shortcomings of the core 3DM merge algorithm that one must be aware of, such that it can't detect delete/edit conflicts, and that move conflicts are really tricky to deal with. I haven't come up with workarounds for all problems yet, but I'm hoping to keep iterating on this until I have.

If you send me an email at slarse@slar.se, I can send you a copy of the finished thesis. It's not yet published, and I've been told it can take a while before that happens, but I'm free to share it as I wish until it is.

For anyone else interested, the thesis is temporarily available here until it's officially published.

Thank you for posting this! This is extremely informative and helpful for me.

One thing that I had noticed isn't mentioned here are the motivations for choosing Spoon rather than the other AST backends that GumTree offers for Java (e.g. JDT, JavaParser, srcML). Was there a technical reason for choosing Spoon over the others? I was curious because you had mentioned a "sufficiently powerful AST representation" being a prerequisite, and I was wondering if Spoon offered something here that the others didn't.

Spoon was mostly chosen out of convenience. The group I did my thesis with have significant expertise in Spoon, and the gumtree-spoon-ast-diff library that's used as a "driver" for GumTree is more finely tuned for Java than vanilla GumTree is. Spoon's pretty-printer was also fairly easy to work with (see notes on pretty-printers below). I also found the JavaParser driver for GumTree to not work too well, it exhibited crashes quite frequently in my initial testing. But mostly, the fact that Spoon worked and I could get extensive support on using it from experts were the deciding factors, as I was really skirting the edges of my own expertise with this thesis.

Spoon is actually built on top of JDT. JDT itself isn't super easy to work with for source code transformations, Spoon is just a whole lot easier to use. From my cursory exploration of JavaParser, it seems like it should be viable as well, but it's really hard to tell before actually trying it out. Essentially, with "sufficiently powerful", I mean:

  • It must be possible to cobble together a merged AST from three different AST revisions. The merged AST that Spork produces is really something of a Frankenstein's monster, but it works :)
  • It must be possible to infer the role (i.e. parameter, local variable, type, etc) of a node in the merged AST. As GumTree can detect moves, the role of a node may change from its source to its destination, so it can sometimes be a bit fiddly.
  • It should preferably be possible to build the merged AST somewhat generically. You don't want to have to build the merged AST from scratch, as that essentially involves recreating large portions of a library like Spoon (thousands upon thousands of lines of code). It's a huge amount of work. The part of Spork that builds the Spoon tree, while not trivial, is only a couple of hundred lines of code. The reason it's so small is largely because of Spoon's CtElement.setValueByRole method, which makes it very easy to build a tree given that you know the role of each node.
  • It must be possible to pretty-print the merged AST. As the nodes are from different source files, printers that use the source position to make certain printing decisions may exhibit some strange behavior.
  • It must be possible to pretty-print conflicts. This can be difficult depending on how the AST's pretty-printer is built, and how the AST itself is built. A pretty-printer is not at all as hard to build as an AST-builder, but it's still something that you should avoid if possible.

Structured merge is pretty complex, and so you want to do as little of the busywork as possible and just focus on the actual merging. Really, a huge reason why I managed to create a functioning (and as per the evaluation, pretty good) structured merge tool in just a couple of months was that I could leave most of the work to other libraries.