/build-bom

Dynamically discover the commands used to create a piece of software

Primary LanguageRustApache License 2.0Apache-2.0

Overview

This tool makes it easy to capture the steps taken during the build process of a software project. This can be very useful for:

  • Understanding build processes
  • Debugging obscure build problems
  • Applying static analysis and verification tools

In the first two cases, this tool gives a low-level view of exactly the set of files accessed by the build process (e.g., fully resolving all file includes and relative paths) in a way that is difficult to achieve by merely reading and understanding a build system. In a sense, it identifies the bill of materials for software.

In the third case, assurance tools generally require rebuilding programs in special modes or with alternative compilers (e.g., into LLVM bitcode for analysis or instrumentation). Doing so is typically labor intensive, as it requires extensive work to understand an existing build system, and more work still to modify it. This tool provides a way to apply analysis tools in a build system agnostic way.

This tool is primarily designed to help tame the myriad build systems of the C/C++ ecosystem, but it applies to any software project with a build step.

Usage

This tool wraps your normal build command and builds LLVM bitcode when possible. It arranges things such that any binary artifacts produced by your build system (e.g., object files, archives, shared libraries, or binaries) have their LLVM bitcode attached to them and accessible. The workflow of this tool proceeds in two phases:

  1. Building your software (using the generate-bitcode command wrapper)
  2. Extracting your bitcode (using the extract-bitcode command)

There tool also supports some auxiliary commands for generating traces of builds for visualization and understanding.

An example use of the tool is shown below for a make-based build:

$ build-bom generate-bitcode -- make
$ build-bom extract-bitcode /path/to/binary --output=/tmp/output.bc

In the first step, the tool acts as a wrapper around the real build system. It runs the build system and, if it observes any compilation commands, it runs an extra build of the source file using clang to generate bitcode. It attaches bitcode to object files, and then resumes the build.

In the next step, the tool extracts all of the accumulated bitcode.

Bitcode Generation Options

The generate-bitcode command has a number of options that may be useful in various contexts.

  • --bc-out: Directory to place LLVM bitcode (bc) output data
  • --clang: Specify the full name (or path if desired) of a clang binary to use
  • --objcopy: Specify the full name (or path if desired) of the objcopy binary to use
  • --suppress-automatic-debug: By default, build-bom automatically adds the -g flag when building bitcode to generate debug information; this flag inhibits this behavior
  • --strict: Generate strictly adhering bitcode: leaves in compilation arguments (e.g. optimization, arch-specific flags, etc.) that are normally removed because they might be problematic for clang.
  • --inject-argument=STRING: Directs build-bom to inject an additional argument (STRING) into the command line for the command used to build bitcode (e.g., to configure the optimization level or level of debug information); can be specified multiple times
  • --remove-argument=REGEX: Directs build-bom to remove any argument matching the regular expression from the argument list when generating bitcode; can be specified multiple times
  • --preproc-native: performs the preprocessing step using the compiler native to the build, and then compiles the result with clang. This can be very useful for cross-compilers whose header files are incompatible with clang: the pre-processing step performs all of the #include and #define operations in the context of the cross-compiler, and the result should be C code that can then be turned into LLVM IR bitcode by clang.

Note that --suppress-automatic-debug could be useful in cases where the generated bitcode is disruptively large due to the presence of unneeded debug information. Since it is useful in most cases, however, it is generated by default.

The --remove-argument can be used to remove arguments that inhibit analysis (e.g., -O3 may apply optimizations that are annoying for a static analysis, so it could be removed). Note that build-bom does not add any anchors to the beginning (e.g., ^) or end (e.g., $) of the regular expression it is given, so users will likely want to specify them manually as needed. The regular expressions are matched against each argument as seen by execve, so conjoined single-argument flags like --foo=bar count as a single flag that could be matched against, while --foo bar appear as two separate entries in the argument list seen by execve. Without explicit regex anchors, build-bom allows the specified regex to match anywhere in each argument.

Bitcode Extraction Options

The extract-bitcode command also provides options:

  • --llvm-link: Specify a name or path to the llvm-link binary; this is useful if LLVM commands are versioned on your system
  • --objcopy: Specify a name or path for the objcopy binary.

Design

The tool uses low-level operating system services to observe builds and record their actions. On Linux, it uses ptrace to observe every system call. When a source compilation command is observed, the tool generates the corresponding bitcode file using clang. It attaches the bitcode to the object file via a separate ELF section, allowing bitcode to be accumulated as a side effect of the build. At every stage, bitcode remains attached to build artifacts to ensure it is not lost.

There are four key observations enabling this approach to bitcode collection:

  1. Whenever we see the original build system compile a C/C++ file, we know we need to make the corresponding bitcode file
  2. We can attach arbitrary extra data (e.g., bitcode) to object files in extra ELF sections
  3. ELF sections containing data without special meaning are concatenated by the linker
  4. Standard tar files can be concatenated to produce a valid tar file that is the union of their contents

We wrap our generated bitcode in singleton tar files and allow the linker to accumulate them for us. When we want to collect aggregated bitcode for executable artifacts, we simply extract the tar file from their special LLVM bitcode ELF sections, extract the collected bitcode, and link it together with llvm-link.

./doc/build_bom_seq.svg

Observe as well that the build-bom process useful for selective rebuilds: rebuilding only a portion of the sources will still have access to llvm-bitcode ELF sections in object from previous builds. The use of build-bom also has graceful degradation properties: object files which do not have llvm bitcode sections in their ELF (i.e. built separately without using build-bom) will simply not contributed to the ELF section/tarfile accumulation of bitcode; the final extraction llvm-link does not need to be total and is tolerant of unresolved symbols.

The bitcode extracted will be representative of the binary code contained in the specified file. It will not necessarily be identical to that code due to strictness flags, differences between clang and the native build compiler, and a different linking step.

  • Executable: bitcode for the entirety of the executable, including any static libraries the executable was linked with, but not including any shared libraries (even if they themselves were built with build-bom) or components built outside of a build-bom process.
  • Shared library: bitcode for the entirety of the shared library will be extracted, excluding any components of the library built outside of a build-bom process.
  • Static library: bitcode will only be available for the last element in the library. This is due to build-bom’s use of objcopy to extract the ELF sections: all llvm bitcode sections from each member of the static library will be extracted, but they will successively overwrite each other, leaving only the bitcode from the last entry in the library. This is also noted in #42.

This tool is also able to record all relevant system calls into a log. The tracing is designed to capture all of the information necessary to replay a build. It currently doesn’t capture everything (especially file move and directory operations), but will be extended as-needed. Beyond system calls, it also captures the environment and working directory of each executed command.

The tool currently supports Linux, but is designed so that it will be modular enough to have separate tracing implementations for MacOS and Windows, while sharing the rest of the code.

Related Tools

There are a number of tools in the space of build interposition for the purpose of instrumentation, build modification, or bitcode generation. Most are based on acting as wrappers around standard compilers either through explicit modification of the build system or by placing themselves earlier in the PATH as aliases to real build tools.

  • Tools like wllvm and gllvm solve the problem of wrapping compiler commands to generate LLVM, but require manual modifications to the build system in order to invoke them.
  • Tools like Bear and blight provide general mechanisms for interposing on build commands by pretending to be a normal compiler earlier in your PATH. Bear additionally provides another mode based on using LD_PRELOAD to hook calls to execve.
  • Other tools record builds and replay them

These tools can be very effective, but have some issues with more complex build systems:

  • Scripts that wrap compiler commands can have difficulty successfully getting through complex configure scripts that e.g., do aggressive version sniffing
  • While configure script difficulties can be sometimes avoided by configuring with the real compiler and replacing or interposing the real build commands after the fact, it doesn’t always work
    • Build systems that record absolute paths at configure time are difficult to modify completely
    • Some build systems run additional configure scripts as part of the build process, which are again difficult to pass using interposition
  • Using LD_PRELOAD to hook execve can be very effective, but difficult, as some build systems rely on failed execve calls to perform PATH searches; it is difficult to know which commands succeed, as execve never returns in those contexts
  • The LD_PRELOAD approach does not work for statically-linked compilers (so Bear has a fallback to wrapper scripts)
  • Some types of multi-stage build require that all intermediate results actually be built and be executable (e.g., if a build creates a code generator and uses it for later build stages)
  • Replaying builds based solely on compiler commands works for simple builds, but fails when build systems create and delete directories during the build (or make other interesting environmental changes) that make consistent replay very difficult

As a whole, these tools tend to require significant effort in build system understanding and modification to work on more complex codebases. The build-bom tool is designed to eliminate any need for build system modification to achieve its goals (primarily LLVM bitcode generation, but potentially arbitrary build modifications). In contrast to the other tools in this space, it monitors and interposes on the build system at the level of ptrace.

  • By working at the level of execve, it can observe when real build tools are called, no matter what names the build system thinks they actually have (e.g., if the build system itself uses build tool wrappers)
  • By working directly at the syscall level (rather than LD_PRELOAD), it works on both static and dynamically-linked build tools
  • By working at the level of execve, build-bom never needs to implement any shell lexing logic, as the shell has already lexed all of the arguments
  • By working at the ptrace level, build-bom is able to determine which calls to execve actually succeed
  • Moreover, it can delay action until after build steps succeed (since it can observe when execed processes terminate, not just when they are about to start)
  • The build-bom tool is able to maintain persistent state for an entire build without external storage, as a single process is able to view all build steps
  • Configure scripts are never a problem (at any stage of the build) because the real build always runs
  • Multi-stage builds always work because intermediate tools are build and are executable

Caveats

  • It is not possible to take advantage of parallel builds while using this tool, as all system calls in the entire build tree are serialized through a single tracing process
  • Build steps that rely on input our output redirection through pipes are very difficult to replicate, since their targets are not observable without modeling the calling process file descriptor connection logic

Full Example

Here is a full example on a real codebase:

wget https://ftp.gnu.org/gnu/tar/tar-1.32.tar.gz
tar xf tar-1.32.tar.gz
cd tar-1.32
./configure
# Run the build under the bitcode generator
build-bom generate-bitcode -- make
# Use a suffix on LLVM tools because they are version-suffixed on Ubuntu
build-bom extract-bitcode src/tar --output=../tar.bc --llvm-link=llvm-link-9

Roadmap

  • Serious polish required
  • Build step dependency analysis for in-order replay
  • Add more thorough support for Linux system calls
    • Add a 32 bit x86 syscall table
    • Add ARM syscall tables
    • Explore automated processing of system call argument lists
  • Additional tools
    • Dependency graph analyzer and visualizer
    • A command to list all targets (or all library targets or all executable targets)
    • A command to rebuild a target binary with libfuzzer, Address Sanitizer, or Thread Sanitizer
    • Add a command to randomly test for potential missing dependencies in build systems
  • Automated granular filename tracking (to precisely model renames)
  • Fix parallel builds
  • Full handling of environment variables
  • Additional normalization policies
    • Ignore trivial dependencies like ld.so
    • Add ability to ignore dynamically loaded library dependencies
  • Easier scripting
  • MacOS backend based on Dtrace
  • Windows backend

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.