McSema lifts x86 and amd64 binaries to LLVM bitcode modules. McSema support both Linux and Windows binaries, and most x86 and amd64 instructions, including integer, X87, MMX, SSE and AVX operations.
McSema is separated into two conceptual parts: control flow recovery and instruction translation. Control flow recovery is performed using the mcsema-disass
tool, which uses IDA Pro to disassemble a binary file and produces a control flow graph. Instruction translation is performed using the mcsema-lift
tool, which converts the control flow graph into LLVM bitcode.
master | |
---|---|
Linux |
- Translates 32- and 64-bit Linux ELF and Windows PE binaries to bitcode, including executables and shared libraries for each platform.
- Supports a large subset of x86 and x86-64 instructions, including most integer, X87, MMX, SSE, and AVX operations.
- Runs on both Windows and Linux, and can translate Linux binaries on Windows and Windows binaries on Linux.
- Output bitcode is compatible with the LLVM toolchain (versions 3.5 and up).
- Translated bitcode can be analyzed or recompiled as a new, working executable with functionality identical to the original.
- McSema runs on Windows and Linux and has been tested on Windows 7, 10, Ubuntu 14.04, and Ubuntu 16.04.
Why would anyone translate binaries back to bitcode?
-
Binary Patching And Modification. Lifting to LLVM IR lets you cleanly modify the target program. You can run obfuscation or hardening passes, add features, remove features, rewrite features, or even fix that pesky typo, grammatical error, or insane logic. When done, your new creation can be recompiled to a new binary sporting all those changes. In the Cyber Grand Challenge, we were able to use mcsema to translate challenge binaries to bitcode, insert memory safety checks, and then re-emit working binaries.
-
Symbolic Execution with KLEE. KLEE operates on LLVM bitcode, usually generated by providing source to the LLVM toolchain. Mcsema can lift a binary to LLVM bitcode, permitting KLEE to operate on previously unavailable targets.
-
Re-use existing LLVM-based tools. KLEE is not the only tool that becomes available for use on bitcode. It is possible to run LLVM optimization passes and other LLVM-based tools like libFuzzer on lifted bitcode.
-
Analyze the binary rather than the source. Source level analysis is great but not always possible (e.g. you don't have the source) and, even when it is available, it lacks compiler transformations, re-ordering, and optimizations. Analyzing the actual binary guarantees that you're analyzing the true executed behavior.
-
Write one set of analysis tools. Lifting to LLVM IR means that one set of analysis tools can work on both the source and the binary. Maintaining a single set of tools saves development time and effort, and allows for a single set of better tools.
Name | Version |
---|---|
Git | Latest |
CMake | 3.2+ |
Google Protobuf | 2.6.1 |
Google Flags | Latest |
Google Log | Latest |
Google Test | Latest |
Intel XED | Latest |
LLVM | 3.5+ |
Clang | 3.5+ (3.9 if using Visual Studio 2015) |
Python | 2.7 |
Python Package Index | Latest |
python-protobuf | 3.2.0 |
IDA Pro | 6.7+ |
Visual Studio | 2013+ (Windows Only) |
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install \
git \
cmake \
python2.7 python-pip \
build-essential \
realpath
sudo pip install --upgrade pip
sudo pip install 'protobuf==3.2.0'
Note: If you are using IDA on 64 bit Ubuntu and your IDA install does not use the system Python, you can add the protobuf
library manually to IDA's zip of modules.
# Python module dir is generally in /usr/lib or /usr/local/lib
touch /path/to/python2.7/dist-packages/google/__init__.py
cd /path/to/lib/python2.7/dist-packages/
sudo zip -rv /path/to/ida-6.X/python/lib/python27.zip google/
sudo chown your_user:your_user /home/your_user/ida-6.7/python/lib/python27.zip
Users wishing to run McSema on Ubuntu 14.04 should upgrade their version of CMake.
sudo add-apt-repository -y ppa:george-edison55/cmake-3.x
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install cmake
The next step is to clone the Remill repository. We then clone the McSema repository into the tools
subdirectory of Remill. This is kind of like how Clang and LLVM are distributed separately, and the Clang source code needs to be put into LLVM's tools directory.
git clone --depth 1 https://github.com/trailofbits/remill.git
pushd remill/tools
git clone --depth 1 https://github.com/trailofbits/mcsema.git
popd
Several Trail of Bits projects depend on a common subset of C and C++ codebases. We've organized these dependencies into the cxx-common repository.
git clone --depth 1 https://github.com/trailofbits/cxx-common.git
mkdir remill-build
cd remill-build
../cxx-common/build.sh --template everything `pwd`/libraries
Note: This will build all of the dependencies needed by Remill and McSema. This includes LLVM, Clang, Intel XED, Google Protocol Buffers, Google Log, Google Flags, and Google Test.
If you are using Ubuntu 16.04 and want to skip this step, then download and extract libraries
from one of the following URLs. McSema works across several versions of LLVM. This makes integrating into third-party projects using older LLVM versions (e.g. KLEE) easier.
OS | LLVM | Download URL |
---|---|---|
Ubuntu 16.04 | 4.0 | https://s3.amazonaws.com/cxx-common/libraries-llvm40-ubuntu160402.tar.gz |
Ubuntu 16.04 | 3.9 | https://s3.amazonaws.com/cxx-common/libraries-llvm39-ubuntu160402.tar.gz |
Ubuntu 16.04 | 3.8 | https://s3.amazonaws.com/cxx-common/libraries-llvm38-ubuntu160402.tar.gz |
Ubuntu 16.04 | 3.7 | https://s3.amazonaws.com/cxx-common/libraries-llvm37-ubuntu160402.tar.gz |
Ubuntu 16.04 | 3.6 | https://s3.amazonaws.com/cxx-common/libraries-llvm36-ubuntu160402.tar.gz |
Ubuntu 16.04 | 3.5 | https://s3.amazonaws.com/cxx-common/libraries-llvm36-ubuntu160402.tar.gz |
Note: This will build McSema using the LLVM 3.9 toolchain. If you want to use McSema with other versions of the LLVM toolchain, then manually specify targets to the cxx-common/build.sh
script. For example, to build LLVM and Clang 3.5, do the following:
The next step is to build the code. McSema (and Remill) must be built using the Clang compiler.
export TRAILOFBITS_LIBRARIES=`pwd`/libraries/
cmake ../remill
make -j4
Note: If you are using custom version of LLVM then specify the following at the command line before running cmake
:
export LLVM_INSTALL_PREFIX=/path/to/llvm/install/dir
TODO TODO
- Common Errors and Debugging Tips
- How to implement the semantics of an instruction
- How to use mcsema: A walkthrough
- Life of an instruction
- Limitations
- Navigating the source code
- Using Mcsema with libFuzzer
If you are experiencing problems with McSema or just want to learn more and contribute, join the #binary-lifting
channel of the Empire Hacking Slack. Alternatively, you can join our mailing list at mcsema-dev@googlegroups.com or email us privately at mcsema@trailofbits.com.
McSema is pronounced 'em see se ma' and is short for machine code semantics.
McSema's goal is binary to bitcode translation. Accurate disassembly and control flow recovery is a separate and difficult problem. IDA has already invested countless man-hours into getting disassembly right, and it only makes sense that we re-use existing work. We understand that not everyone can afford an IDA license. With the original release of McSema, we shipped our own tool recursive descent disassembler. It was never as good as IDA and it never would be. Maintaining the broken tool took away valuable development time from more important McSema work. We hope to eventually transition to more accessible control flow recovery frontends, such as Binary Ninja (we have a branch with initial Binary Ninja support). We very warmly welcome pull requests that implement new control flow recovery frontends.
We would love to take you on as an intern to help improve McSema. We have several project ideas labelled intern_project
in the issues tracker. You are not limited to those items: if you think of a great feature you want in McSema, let us know and we will sponsor it. Simply contact us on our Slack channel or via mcsema@trailofbits.com and let us know what you'd want to work on and why.