LLTFI

LLTFI (Low-Level Tensor Fault Injector) is a unified SWiFI (Software-implemented fault injection) tool that supports fault injection of both C/C++ programs and ML applications written using high-level frameworks such as TensorFlow and PyTorch.

As machine learning (ML) has become more prevalent across many critical domains, so has the need to understand ML system resilience. While there are many ML fault injectors at the application level, there has been little work enabling fault injection of ML applications at a lower level. LLTFI is a tool that allows users to run fault injection experiments on C/C++, TensorFlow and PyTorch applications at a lower level (at the LLVM IR level). Please refer to the following paper for more information about LLTFI.

LLTFI is built on top of LLFI and is fully backward compatible with it.

LLFI

LLFI is an LLVM-based fault injection tool, that injects faults into the LLVM IR of the application source code. The faults can be injected into specific program points, and the effect can be easily tracked back to the source code. LLFI is typically used to map fault characteristics back to source code and hence understand source level or program characteristics for various kinds of fault outcomes. Detailed documentation about LLFI can be found at: https://github.com/DependableSystemsLab/LLFI/wiki. Because LLTFI is designed to be backward compatible with LLFI, the basic setup instructions for LLTFI are similar to those of LLFI. However, there are additional steps and dependencies for running ML programs.

LLTFI Workflow:

High-level ML models need to be lowered to intermediate representation (IR) for fault injection. LLTFI provides a single script that converts ML models into LLVM IR, using several publicly available tools and performs fault injection. LLTFI first lowers ML models to MLIR (Multi-Level Intermediate Representation) using ONNX-MLIR before converting to LLVM IR. The reasons for choosing MLIR are MLIR's ability to better preserve the semantics of ML models, its integration with LLVM, testability and easier extensibility.

Workflow Diagram of LLTFI:

LLTFI first converts all ML models to the ONNX format. ONNX’s open exchange format allows LLTFI to support both TensorFlow and PyTorch.
Then, the ONNX file is converted into MLIR through ONNX-MLIR.
Finally, we convert MLIR into LLVM IR, using the mlir-translate tool in LLVM 15.0.

LLTFI can now inject faults into the LLVM IR, alike lowered C/C++ programs.

The LLFI tool was originally written for LLVM 3.4. While developing LLTFI, the entire LLFI tool was upgraded to LLVM 15.0 because LLVM 3.4 has no support for MLIR. This upgrade also ensured that LLTFI is compatible all of the newest C/C++ features, and LLVM optimization passes

Auto-Installer

If you wish to build LLTFI and its dependencies via the auto-installer(installer/InstallLLTFI.py), you do not need to clone the LLTFI git repository. Simply download the installer script by itself, and it will fetch the latest version of the git repository for you. To run the script, simply copy it into the directory where you would like to build the LLTFI and, from the command line, run python3 InstallLLTFI.py.

Dependencies:

64 Bit Machine (preferably with GPU for faster training of ML programs)
64 bit Linux (Ubuntu 20.04) or OS X
CMake (minimum v3.15)
Python 3 and above
Ninja >= 1.10.2
Internet Connection

Usage:

Copy the InstallLLTFI.py script to where you want to build the LLTFI. Run "python3 InstallLLTFI.py -h" to see all running options/guidelines
Run "python3 InstallLLTFI.py"

Manual Installation

In this method, the developer has more control over the location of the LLVM build that the LLTFI requires. If you already have LLVM built, you could use that build.

Dependencies:

64 Bit Machine (preferably with GPU for faster training of ML programs)
64 bit Linux (Ubuntu 20.04) or OS X
CMake (minimum v3.15)
Python 3 and above
Python YAML library (PyYAML v5.4.1)
Ninja >= 1.10.2
libprotoc >= 3.11.0
Clang v15.0 (commit: 9778ec057cf4)

LLVM v15.0 (commit: 9778ec057cf4) (Reference). LLVM 15.0 takes a long time to completely build. Following is a shortcut to checking out the required LLVM commit, and building only the necessary LLVM targets.

  git clone https://github.com/llvm/llvm-project.git
  
  # Check out a specific branch that is known to work with the required version of ONNX MLIR.
  cd llvm-project && git checkout 9778ec057cf4 && cd ..
  	
  mkdir llvm-project/build
  cd llvm-project/build
  
  cmake -G Ninja ../llvm \
  	-DLLVM_ENABLE_PROJECTS="clang;mlir" \
  	-DLLVM_BUILD_TESTS=ON \
  	-DLLVM_TARGETS_TO_BUILD="host" \
  	-DLLVM_ENABLE_ASSERTIONS=ON \
  	-DLLVM_ENABLE_RTTI=ON

  cmake --build . --target clang check-mlir mlir-translate opt llc lli llvm-dis llvm-link -j 2

  ninja install -j 2

For executing ML programs, following additional dependencies have to be installed:

TensorFlow framework (v2.0 or greater)
numpy package (part of TensorFlow)
tensorflow-onnx: Installation with pip is sufficient
```
pip install tf2onnx
```

libprotoc

curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.17.2/protobuf-all-3.17.2.zip
unzip protobuf-all-3.17.2.zip
cd protobuf-3.17.2

./configure
make -j 2
make check
sudo make install
sudo ldconfig # refresh shared library cache.

ONNX-MLIR

Additional changes made in the ONNX-MLIR code are present in: https://github.com/DependableSystemsLab/onnx-mlir-lltfi. Clone this repo and checkout the LLTFI branch. The MLIR_DIR cmake variable must be set before building onnx-mlir. It should point to the mlir cmake module inside an llvm-project build or install directory (e.g., llvm-project/build/lib/cmake/mlir).

MLIR_DIR=$(pwd)/llvm-project/build/lib/cmake/mlir

Onnx-mlir branch LLTFI has to be built and installed.

git clone --recursive https://github.com/DependableSystemsLab/onnx-mlir-lltfi.git
mv onnx-mlir-lltfi onnx-mlir && cd onnx-mlir
git checkout LLTFI
cd ..

mkdir onnx-mlir/build && cd onnx-mlir/build
cmake -G Ninja \
	-DCMAKE_CXX_COMPILER=/usr/bin/c++ \
	-DMLIR_DIR=${MLIR_DIR} \
	.. 
	
cmake --build .
	
# Run lit tests:
export LIT_OPTS=-v
cmake --build . --target check-onnx-lit

ninja install

GraphViz package (for visualizing error propagation)

Building LLTFI:

Run ./setup --help for build instructions.

  $ ./setup --help

  Usage: setup OPTIONS
  List of options:
  -LLVM_DST_ROOT <LLVM CMake build root dir>:
      Make sure you build LLVM with CMake and pass build root directory here
  -LLVM_SRC_ROOT <LLVM source root dir>
  -LLFI_BUILD_ROOT <path where you want to build LLFI>
  -LLVM_GXX_BIN_DIR <llvm-gcc/g++'s parent dir> (optional):
      You don't need to set it if it is in system path
  -JAVA_HOME_DIR <java home directory for oracle jdk 7 or higher> (optional):
    You don't need to set it if your default jdk is oracle jdk 7 or higher and in system path


  --help(-h): show help information
  --runTests: Add this option if you want to run all regression tests after building LLFI.

Below is the command to build LLTFI(without GUI) if clang is already in $PATH:

./setup -LLFI_BUILD_ROOT $BUILD/LLFI -LLVM_SRC_ROOT $SRC/llvm-15.0 -LLVM_DST_ROOT $BUILD/llvm-15.0

Details about running the Web GUI for LLTFI can be found here

Building LLTFI using Docker:

docker/Dockerfile can be used to build and run LLTFI in a docker container. You can modify the Dockerfile according to your system and project requirements. More details can be found here

Steps to build:

Creating a docker image from the Dockerfile: Copy the Dockerfile to a directory of your choice outside this repository. To create an image, run the command docker build --tag imageName . in the terminal.
Starting a docker container: Once the above step is completed, a docker container can be started using the command docker run -it imageName

Running tests:

Running all regression tests after installation is highly recommended. Note that you may encounter some error messages during the fault injection stage. This is normal. Once all tests have been completed and they all passed, LLFI is correctly installed.

For complete test of whole of LLFI, please use LLFI test suite and refer to the wiki page: Test suite for regression test for details.

Running Sample Programs

You can use test programs in the directory sample_programs/ or test_suite/PROGRAMS/ to test LLFI. Programs in the sample_programs directory already contain a valid input.yaml file.

Example program: factorial:

Copy the sample_programs/factorial/ directory to your project directory.
Set LLFI_BUILD_ROOT environment variable e.g., export LLFI_BUILD_ROOT=/path/to/LLFI/installation
Call the ./compileAndRun.sh script with the first argument as factorial, and the second argument as the number to compute the factorial of (e.g., 6)

Results

After fault injection, output from LLFI and the tested application can be found in the llfi directory.

Directory	Contents
std_output	Piped STDOUT from the tested application
llfi_stat_output	Fault injection statistics
error_output	Failure reports (program crashes, hangs, etc.)
trace_report_output	Faults propagation report files and graph

Reproducing the experiments in our ISSRE'23 paper

Please refer to the following README file for instructions on obtaining benchmarks and reproducing the experiments in our ISSRE'23 paper. ISSRE'23 AE

References

LLFI Paper
LLFI Wiki
Udit Kumar Agarwal, Abraham Chan, Karthik Pattabiraman. LLTFI: Framework agnostic fault injection for machine learning applications (Tools and Artifacts Track). International Symposium on Software Reliability Engineering (ISSRE), 2022. 10 pages. LLTFI Paper
Udit Kumar Agarwal, Abraham Chan, Karthik Pattabiraman. Resilience Assessment of Large Language Models under Transient Hardware Faults (PER). International Symposium on Software Reliability Engineering (ISSRE), 2023. Paper

Citations

@article{Agarwal22LLTFI,
  title={LLTFI: Framework agnostic fault injection for machine learning applications (Tools and Artifacts Track)},
  author={Agarwal, Udit and Chan, Abraham and Pattabiraman, Karthik},
  journal={International Symposium on Software Reliability Engineering (ISSRE)},
  year={2022},
  publisher={IEEE}
}

====== Read caveats.txt for caveats and known problems.

DependableSystemsLab/LLTFI