boricj/ghidra-delinker-extension

Implementation question

Opened this issue · 6 comments

gynt commented

I am planning on using this!

I am wondering how you deal with the following scenario:

Two singleton C++ classes (A and B) reference each other's data inside functions.

# File: A.hpp
class A {
   int32_t number = 0;
};

# File: B.hpp
class B {
   int32_t number = 0;
};

# File: main.cpp
const A a;
const B b;

int32_t A::touchB() {
  this->number = this->number  + b->number;
  return this->number;
}

int32_t B::touchA() {
  this->number = this->number  + a->number;
  return this->number;
}

In optimized compiled code, A and B may be placed next to each other in memory, and therefore the reference to b->number in machine code might look something like:

mov eax, dword ptr [ ecx + 0x04 ] ; Assuming (the structs of) a and b are exactly next to each other in memory

or like

mov eax, dword ptr [ 0x1000004 ] ; Assuming (the structs of) a and b are exactly next to each other in memory and the struct of a lives at 0x1000000

How does this program delink this? I am guessing that it puts these functions in the same object file, especially in the latter example.
Because how would it know these came from different files? (and therefore different obj files).

Of course I can try this out for myself, but I figure I ask before I embark down this rabbit hole! It will set my expectations

Hi @gynt,

I've never attempted to delink C++ code before. I've given it some thoughts and did some tests on my side, here's my brain dump on this. I tried to keep this accessible without going into too many details, which is why there's a TL;DR at the end if you just want a "what would it take?" answer.

That being said, I'll preemptively answer these two questions first:

How does this program delink this?

Short answer: it's transparent to the end-user of the extension.

Long answer: the relocation synthesizer analyzer leverages symbols, references and data types present inside a Ghidra database to identify and undo relocation spots, whose work can be audited using Window > Relocation table (synthesized). Parts 7 to 10 of my series of articles on reverse-engineering break down an example by hand in a manner similar to how my extension works.

Because how would it know these came from different files?

It doesn't. You can export any subset of a program at the symbol granularity as an object file, regardless of how the program was structured initially. It's just section bytes, symbols and relocations at this point.

Put it in another way: as long as you don't cut across a variable of function, you could theoretically cut up a program into any arbitrary shape you want and this extension should generate a set of object files that, if all linked together, should produce a working program that is functionally identical to the original one.

C++ and delinking in theory

My whole delinking shtick relies on the fact that traditional linkers work on object files and are language agnostic. From their point of view, an object file built using a C++ compiler is no different than an object file built using a C compiler or the output produced by an assembler. Within that assumption, delinking C++ (or any language following the traditional compile-assemble-link toolchain flow) should be theoretically possible... but there are some ABI concerns nevertheless to keep in mind.

Symbol name mangling

C++ mangles symbol names. That's not a direct concern for delinking, but the symbol names do need to be correct for subsequent linking to work and my extension currently doesn't have any specific support for that. You could mangle the symbol names by hand inside Ghidra, but a saner option would be to add mangling options to the object file exporter. There might be some corner cases to account for (like handling export "C"), but it should be fairly straightforward to implement.

C++-specific data

While traditional linkers are blind to C++'s considerations, compilers do emit C++-specific data like RTTI or unwind tables. If you try to delink code using these features, that data must be exported alongside the rest of the object file for it to work correctly. If annotated properly, that data should be theoretically delinkable like any other piece of data (hopefully).

C++ and delinking in practice

The test case

cpp-test-case.tgz

Rather than just blindly guessing, let's put my extension to the test with a very basic C++ program, which should return 2 as its status code:

#include <stdint.h>

class A
{
public:
  int32_t number = 0;

  int32_t touchB();
};

class B
{
public:
  int32_t number = 1;

  int32_t touchA();
};

A a;
B b;

int32_t A::touchB()
{
  this->number = this->number + b.number;
  return this->number;
}

int32_t B::touchA()
{
  this->number = this->number + a.number;
  return this->number;
}

int main()
{
  a.touchB();
  b.touchA();

  return b.number;
}

I've done this exercise with both -Og and -O3:

  • Build it with i686-linux-gnu-g++ with -fno-pic and -no-pie (no need to add GOTs and PLTs to the mix).
  • Import the ELF program into Ghidra as-is, without touching up the Ghidra project in any way.
  • Identify the subset of addresses that come from the object file and create a program tree to keep track of them (for convenience).
  • Delink that subset into a .delinked.o ELF object file.
  • Link that exported object file into a .delinked.elf ELF program.
  • Run both readelf and i686-linux-gnu-objdump on every artifact for some human-readable output.

I've attached an archive with all those files for reference.

The results

  • The relinked programs do work exactly like the originals.
  • main has the correct symbol name (C linkage), but both A::touchB() and A::touchA() aren't mangled. It works out in this specific case because the object file is self-consistent.
  • The relocation table synthesizer analyzer failed to identify the relocations within .eh_frame for some reason and therefore these sections were not delocated properly.
  • ld wasn't happy with the corrupted .eh_frame sections of the delinked object files: ld: error in main.O3.delinked.o(.eh_frame); no .eh_frame_hdr table will be created.

Overall there are things to fix (symbol name mangling, missed .eh_frame relocations), but it did work out on this test case. I can make the following observations:

  • To the extent that C++ code is assembled into machine code, it doesn't appear significantly different than C code in this regard.
  • Due to lack of symbol name mangling, C++ symbol references that cross an object file boundary will not be found by the linker.
  • Due to improper delocation of .eh_frame, I do not expect that stack unwinding will work should an exception be thrown due to the corrupted unwind tables.
  • No data on RTTI.
  • No data on global constructors/destructors.
  • Since ld did mind the corrupted .eh_frame sections, it is not strictly a traditional, C++-oblivious linker as assumed by my delinking technique.

TL;DR

It sorta works in the current state with a bunch of pitfalls. There are things I didn't test, but with some fixes/improvements I think delinking C++ code can be made to work well enough to be useable in practice.

I did assume that you want to delink C++ programs back into object files for the same toolchain/platform, so no crazy Linux-to-Windows or PlayStation-to-Linux chimeras like I do. This side-steps a whole bunch of cross-platform ABI compatibility issues that are too scary to contemplate for C++. Also, looking at your GitHub profile I can probably bet that you want to delink Windows executables built using the MSVC toolchain. You won't be able to cheat with MinGW like I did once since these two toolchains are reportedly compatible only at the C interface level.

Therefore, you'll probably need a COFF object file exporter in order to produce object files that MSVC can grok. I only have an ELF object file exporter at the moment, but my data model and analyzers should be generic enough for COFF. A prototype could probably be banged out in a week-end binge, but object file exporters are very finicky to get just right and a fairly exhaustive regresssion test suite is all but required to have any confidence in the results.

Also, I only have code analyzers for i386 and 32-bit MIPS. CISC architectures are fairly easy to analyze so adding x86_64 support should be fairly easy. RISC architectures on the other hand... Let's just say I'm at my fifth attempt for MIPS and it's still wonky.

Post-scriptum

Sorry for the huge wall of text. I've found that delinking is an esoteric topic that requires paying attention to a lot of very fine details in order to work. I've automated it down to a couple of clicks with my extension in practice, but unfortunately there are no such shortcuts available for theory.

I should probably write a book at some point because there's hardly any resources about delinking out there, let alone an authoritative source I could cite for brevity's sake. At the very least, it might make for a very scary bedtime reading for linker developers.

gynt commented

If you would write a book on this topic I would read it!
Thanks for the very clear explanation.

Therefore, you'll probably need a COFF object file exporter in order to produce object files that MSVC can grok.

You are correct. I am trying to use this on a i386 windows PE binary from twenty years ago. I found objconv which can allegedly translate elf into coff, haven't tried it yet though.

Due to lack of symbol name mangling, C++ symbol references that cross an object file boundary will not be found by the linker.

The binary I want to use this on is 99% C++ member functions for C++ static singleton variables that are statically constructed before main() is run. So I kinda need the symbol name mangling, or I need to write my own inline assembly code to link to the object file, which isn't going to be pretty (but it basically is just a mov ecx, pointerToThis; call func;). In Ghidra, the member functions can be identified by the fact they have thiscall calling convention.

Due to improper delocation of .eh_frame, I do not expect that stack unwinding will work should an exception be thrown due to the corrupted unwind tables.

I don't think I care about eh_frame in my use case of this ghidra extension

No data on RTTI.

Makes sense because there weren't any virtual functions in your example. So no dynamic casting happens at runtime.
Less than 1% of my binary uses this, if at all. So this is no problem for me.

My binary consists mostly of C style things (const char *), and almost no C++ library things (std::string).

No data on global constructors/destructors.

Do you mean you didn't have any of that info in the original compiled program? I guess because a and b are not declared static.

Links I found useful

On ignoring eh frames: https://stackoverflow.com/questions/26300819/why-gcc-compiled-c-program-needs-eh-frame-section
Global constructors https://stackoverflow.com/questions/1271248/c-when-and-how-are-c-global-static-constructors-called
https://www.nsnam.org/docs/linker-problems.pdf
Name mangling https://web.mit.edu/tibbetts/Public/inside-c/www/mangling.html

I should clarify that while I know enough about ELF and Linux to pull off this dark magic, this doesn't apply to COFF and Windows. So all my answers are implicitely prefixed by "hopefully COFF and MSVC don't do something completely different than ELF and gcc".

Therefore, you'll probably need a COFF object file exporter in order to produce object files that MSVC can grok.

You are correct. I am trying to use this on a i386 windows PE binary from twenty years ago. I found objconv which can allegedly translate elf into coff, haven't tried it yet though.

Old toolchains were a lot dumber than what we have today. It's unlikely the linker did something smart that causes a migraine... but it's possible it did something stupid instead.

That being said, old artifacts are mostly good news for delinking. No section garbage collection and no link-time optimizations means programs tend to be fairly straightforward in their layout. You might even be able to make decent guesses where the original boundaries of the object files were.

The binary I want to use this on is 99% C++ member functions for C++ static singleton variables that are statically constructed before main() is run. So I kinda need the symbol name mangling, or I need to write my own inline assembly code to link to the object file, which isn't going to be pretty (but it basically is just a mov ecx, pointerToThis; call func;). In Ghidra, the member functions can be identified by the fact they have thiscall calling convention.

Since Ghidra can have multiple labels for a given address (with one designated as the primary label), the simplest option for symbol name mangling would be to put an option to prefer a mangling scheme in the exporter.

In the test case, for one of the methods Ghidra created both the primary label touchB() (within the namespace A) and _ZN1A6touchBEv (within the global namespace). Currently the exporter will only consider the primary label A::touchB(), but if we tell it to prefer "Itanium C++ name mangling" if available it would pick up the _ZN1A6touchBEv label instead. It would then be the end-user's responsibility to ensure mangled labels are provided, as the exporter would fallback to the primary label otherwise (useful for export "C").

Here, Ghidra picked up the mangled names from the .symtab symbol table so it's "free" in this case. If the program is stripped however... I don't have a good answer at the moment. It might be possible to write a script that generates mangled labels, with tricks like Hungarian notation to encode information that Ghidra's database doesn't modelize, like type qualifiers.

I don't think I care about eh_frame in my use case of this ghidra extension

If you don't care about RTTI, unwind tables, SEH and whatever else Windows does differently, you could just delink a C++ program as if it was a C program. It will probably work as long as the delinked code doesn't try to use these features. If it tries however, you'll have some very exotic undefined behavior on your hands.

Just for reference, the relocations inside .eh_frame were missed because I didn't write a PC-relative data relocation synthesizer yet. It would be similar to the existing absolute data relocation synthesizer. There are additional concerns I won't get into (section-relative relocation) but with some luck it might just work as-is, once that relocation synthesizer is written.

My binary consists mostly of C style things (const char *), and almost no C++ library things (std::string).

Hopefully this means you mostly have "C with classes" instead of idiosyncratic C++. Probably good news for delinking.

No data on global constructors/destructors.

Do you mean you didn't have any of that info in the original compiled program? I guess because a and b are not declared static.

This test program doesn't have any global constructors/destructors. It should be no different than any C++-generated data, so if I were to include the necessary bits of .dtors/.ctors in the exportation hopefully global constructors/destructors will just work out.

Overall I think your use-case is doable: my extension is missing a COFF object file exporter and some minor symbol name handling improvements, but my data model and my analyzers (the really tricky parts) should work out of the box. You might want to play a bit with the existing ELF support first and follow along the articles in my blog to get a feel for the workflow.

So I've investigated this a bit on my side on Linux and I've identified a tricky source of problems for C++: section groups, known as COMDAT in Microsoft land.

This covers stuff like vtables, typeinfos, implicit/default constructors/destructors, inline functions, implicit template instantiations... As far as I can tell, these bits can be delinked like any other code or data. They probably won't be a problem during object file exportation as long as they are external references: I think the definitions could come from another object file without any issues, but I haven't actually tested that part.

However, if these bits are exported as part of an object file then it's another story. If these sections aren't handled specifically by the object file exporters, it will lead to multiple symbol definitions down the line during linking since these sections are supposed to be deduplicated. Hopefully most of it can be ignored with the external reference escape hatch mentioned above.

Another thing to keep in mind is C++ ABI compatibility. It's not too much of a problem on Linux as far as I know, but it appears Microsoft doesn't provide any guarantees there across MSVC versions, at least before Visual Studio 2015. You'll probably need to use the same toolchain used to build the original program when reusing its exported bits elsewhere.

In conclusion, I still think delinking C++ code is theoretically doable and my extension can probably handle it if it is suitably improved, but it's going to be trickier than just plain C code since the ABI surface is much larger. At any rate, the biggest blocker for your use-case is the COFF object file exporter. I might get around to do all of that eventually, but I can't make any promises or give any timeline: if you want it anytime soon, you'll probably have to get your hands dirty.

@gynt FYI someone submitted a PR for a COFF object file exporter (#5), in case you're interested.

Coming back to this after getting everything working myself and wanted to add some comments about global initializers, SEH, and resources in MSVC that seem relevant incase someone else comes across this.

Simply including the list of global initializer function pointers in the delink selection isn't enough for MSVC to relink them properly, they need to be in a section with a specific name to be incorporated into the CRT by link.exe. The following is an excerpt from https://github.com/widberg/fmtk/wiki/Decompilation#41-global-initializers

As described in CRT initialization, one can relink the global initalizers with Microsoft link.exe specific tricks. The following C file, when compiled, produces an object file with a list of global initializers in the .CRT$XCU section. Maintaining the order of the entries is important. When linked, MSVCRT will incorporate these into the list of functions called in __cinit.

#pragma section(".CRT$XCU", read)

#define X(x) \
    extern void x(void); \
    __declspec(allocate(".CRT$XCU")) void (*__xc_u_0_##x)(void) = x;

X(FUN_008e4690)
/* etc... */

Regarding SEH, nothing special needs to be done, just make sure the pointer members of the record structs are marked as addresses in Ghidra and they are included in the delinker selection. Relevant section https://github.com/widberg/fmtk/wiki/Decompilation#42-tls-callbacks-structured-exception-handling-and-c-exceptions

Finally, the easiest way to handle the resources is to extract them with Resource Hacker and relink them. You might run into the same thing I did with Windows Side-by-Side where you need to delete/replace the manifest resource. Relevant section https://github.com/widberg/fmtk/wiki/Decompilation#43-resources

In general, the extension works great with C++ using the same toolchain as the original executable. I haven't gone too deep into replacing functions yet but keeping the symbol names consistent and cutting out the code I replace has been enough to keep me out of trouble so far.