/gtirb

Intermediate Representation for Binary analysis and transformation

Primary LanguageC++MIT LicenseMIT

GTIRB

The GrammaTech Intermediate Representation for Binaries (GTIRB) is a machine code analysis and rewriting data structure. It is intended to facilitate the communication of binary IR between programs performing binary disassembly, analysis, transformation, and pretty printing. GTIRB is modeled on LLVM-IR, and seeks to serve a similar functionality of encouraging communication and interoperability between tools.

The remainder of this file describes various aspects of GTIRB:

Structure

GTIRB has the following structure:

    -------Aux Data
   /    /
  /    /   ----DataObject
IR    /   /----Section
  \  /   /-----Symbols
  Modules------SymbolicExpressions
         \-----ImageByteMap
          -----CFG
               / \
           Edges Blocks

IR

An instance of GTIRB may include multiple modules (Module) which represent loadable objects such as executables or libraries. Each module holds information such as symbols (Symbol), data (DataObject), and an inter-procedural control flow graph (CFG). The CFG consists of basic blocks (Block) and control flow edges between these blocks. Each datum and each block holds a range refering to the bytes in the ImageByteMap. Each symbol holds a pointer to the block or datum it references.

Instructions

GTIRB explicitly does NOT represent instructions or instruction semantics but does provide symbolic operand information and access to the bytes. There are many intermediate languages (IL)s for representation of instruction semantics (e.g., BAP's BIL, Angr's Vex, or Ghidra's P-code). GTIRB works with these or any other IL by storing instructions generally and efficiently as raw machine-code bytes and separately storing the symbolic and control flow information. The popular Capstone/Keystone decoder/encoder provide an excellent option to read and write instructions from/to GTIRB's machine-code byte representation without committing to any particular semantic IL. By supporting multiple ILs and separate storage of analysis results in auxiliary data tables GTIRB enables collaboration between independent binary analysis and rewriting teams and tools.

Auxiliary Data

GTIRB provides for the sharing of additional information, e.g. analysis results, in the form of AuxData objects. These can store maps and vectors of basic GTIRB types in a portable way. This repository will describe the anticipated structure for very common types of auxiliary data such as function boundary information, type information, or results of common analyses.

UUIDs

Every element of GTIRB (namely: modules (Module), symbols (Symbol), blocks (Block), and instructions (InstructionRef) has a universally unique identifier (UUID). UUIDs allow both first-class IR components and AuxData tables to reference elements of the IR.

Building

GTIRB should successfully build in 64-bits with GCC, Clang, and Visual Studio compilers supporting at least C++17. GTIRB uses CMake which must be installed.

mkdir build
cd build
cmake ../path/to/gtirb
make -j
# Run the test suite.
./bin/TestGTIRB

The gtirb library will be located in lib/libgtirb.so in the build directory.

Note that you may need to explicitly specify the path to the vcpkg installation when executing CMake. By default, GTIRB looks in C:\vcpkg\scripts\buildsystems\vcpkg.cmake to find the CMake support for the toolchain, but a different path can be specified by passing

-DCMAKE_TOOLCHAIN_FILE="C:\path\to\vcpkg\scripts\buildsystems\vcpkg.cmake"

when executing the CMake command above.

Requirements

The GTIRB build process automatically downloads all external requirements during build. However to install GTIRB, the following requirements should be installed separately.

  • Protobuf versions 3.1 (or later once we disable warnings in the protobuf build)
  • Boost version 1.67.0 or later.

Usage

GTIRB is designed to be serialized using Google's protocol buffers (i.e., protobuf), enabling easy and efficient use from any programming language.

GTIRB may also be used as a C++ library implementing an efficient data structure suitable for use by binary analysis and rewriting applications.

Using Serialized GTIRB Data

The serialized protobuf data produced by GTIRB allows for exploration and manipulation in the language of your choice. The Google protocol buffers homepage lists the languages in which protocol buffers can be used directly; users of other languages can convert the protobuf-formatted data to JSON format and then use the JSON data in their applications. In the future we intend to define a standard JSON schema for GTIRB.

Directory gtirb/src/proto contains the protocol buffer message type definitions for GTIRB. You can inspect these .proto files to determine the structure of the various GTIRB message types. The top-level message type is IR.

For more details, see Using Serialized GTIRB Data

Using the C++ Library

We have provided several C++ examples in directory gtirb/doc/examples. See the Examples tab for more information.

The remainder of this section provides examples walking through common tasks using the GTIRB C++ library API.

Populating the IR

GTIRB objects are created within a Context object. Freeing the Context will also destroy all the objects within it.

Context C;
IR& ir = *IR::Create(C);

Every IR holds a set of modules.

ir.addModule(Module::Create(C));
Module& module = ir.modules()[0];

Addresses are represented by a distinct type which can be explicitly converted to and from uint64_t.

Addr textSectionAddress(1328);

Create some sections:

module.addSection(Section::Create(C, ".text", textSectionAddress, 466));
module.addSection(
    Section::Create(C, ".data", textSectionAddress + 466, 2048));

Create some data objects. These only define the layout and do not directly store any data.

auto* data1 = DataObject::Create(C, Addr(2608), 6);
auto* data2 = DataObject::Create(C, Addr(2614), 2);
module.addData(data1);
module.addData(data2);

The actual data is stored in the module's ImageByteMap:

ImageByteMap& byteMap = module.getImageByteMap();
byteMap.setAddrMinMax({Addr(2608), Addr(2616)});
std::array<uint8_t, 8> bytes{1, 0, 2, 0, 115, 116, 114, 108};
byteMap.setData(Addr(2608), bytes);

Symbols associate a name with an object in the IR, such as a DataObject or Block. They can optionally store an address instead.

auto data = module.data();
module.addSymbol(Symbol::Create(C,
                                data1,      // referent
                                "data1",    // name
                                Symbol::StorageKind::Extern));
module.addSymbol(Symbol::Create(C, data2, "data2",
                                Symbol::StorageKind::Extern));

GTIRB can store multiple symbols with the same address or referent.

module.addSymbol(Symbol::Create(C, data2, "duplicate",
                                Symbol::StorageKind::Local));
module.addSymbol(Symbol::Create(C, Addr(2608), "alias"))

Basic blocks are stored in an interprocedural CFG. Like DataObjects, Blocks reference data in the ImageByteMap but do not directly hold any data themselves. GTIRB does not directly represent instructions.

auto& cfg = module.getCFG();
auto* b1 = emplaceBlock(cfg, C, Addr(466), 6);
auto* b2 = emplaceBlock(cfg, C, Addr(472), 8);

The CFG can be populated with edges to denote control flow. Or edges can be omitted and the CFG used simply as a container for Blocks..

auto edge1 add_edge(vertex1, vertex2, mainModule.getCFG()).first;

Edges can have boolean or numeric labels:

module.getCFG()[edge1] = true;
module.getCFG()[edge2] = 1;

Information on symbolic operands and data is indexed by address:

Symbol* dataSym = &*module.findSymbols(Addr(2614)).begin();
module.addSymbolicExpression(Addr(472), SymAddrConst{0, dataSym});

Finally, auxiliary data can be used to store additional data at the IR level. An AuxData object can store integers, strings, basic GTIRB types such as Addr and UUID, and tuples or containers over these types.

ir.addAuxData("addrTable", std::vector<Addr>({Addr(1), Addr(2), Addr(3)}));
ir.addAuxData("stringMap", std::map<std::string, std::string>(
                             {{"a", "str1"}, {"b", "str2"}}));

Querying the IR

Symbols can be looked up by address or name. Any number of symbols can share an address or name, so be prepared to deal with multiple results.

auto syms = module.findSymbols(Addr(2614));
auto it = syms.begin();
Symbol& sym1 = *it++;
assert(sym1.getName() == "data2");
assert((*it++).getName() == "duplicate");

auto& sym2 = *module.findSymbols("data1").begin();
assert(sym2.getAddress() == Addr(2608));

Use a symbol's referent (either an InstructionRef or DataObject) to get more information about the object to which the symbol points.

DataObject* referent = sym1.getReferent<DataObject>();
assert(referent);
assert(referent->getAddress() == Addr(2614));
assert(referent->getSize() == 2);

Alternatively, DataObjects can be looked up by an address contained within the object. Any number of objects may overlap and contain an address, so be prepared to deal with multiple results.

auto objs = module.findData(Addr(2610));
assert(objs.size() == 1);
assert(objs.begin()->getAddress() == Addr(2608));

The CFG uses boost::graph. GTIRB also provides a convenience function for iterating over blocks:

for (const auto& b : blocks(cfg)) {
  std::cout << "Block: " << uint64_t(b.getAddress()) << ".."
            << uint64_t(addressLimit(b)) << "\n";
}

Blocks contain a vertex_descriptor which is used to look up corresponding information in the CFG:

auto [edgeDescriptor, exists] = edge(b1->getVertex(), b2->getVertex(), cfg);
assert(exists);

edge_descriptors can be used to look up labels and the source/target blocks:

auto edgeRange = edges(cfg);
for (auto it = edgeRange.first; it != edgeRange.second; it++) {
  auto e = *it;
  auto v1 = source(e, cfg);
  auto v2 = target(e, cfg);
  std::cout << "Edge: " << uint64_t(cfg[v1]->getAddress()) << " => "
            << uint64_t(cfg[v2]->getAddress());
  if (auto* b = std::get_if<bool>(&cfg[e])) {
    std::cout << ": " << *b;
  }
  std::cout << "\n";
}

Data have to be resolved to the correct type with the get() method before use. This will return null if the wrong type is requested.

auto addrTable = ir.getAuxData("addrTable")->get<std::vector<Addr>>();
for (auto addr : *addrTable) {
  std::cout << "Addr: " << uint64_t(addr) << "\n";
}

auto* stringMap =
    ir.getAuxData("stringMap")->get<std::map<std::string, std::string>>();
for (auto p : *stringMap) {
  std::cout << p.first << " => " << p.second << "\n";
}

Serialization

Serialize IR to a file:

std::ofstream out("path/to/file");
ir.save(out);

Deserialize from a file:

std::ifstream in("path/to/file");
IR& newIR = *IR::load(C, in);