/LLTFI

LLTFI is a tool, which is an extension of LLFI, allowing users to run fault injection experiments on C/C++, TensorFlow and PyTorch applications at the LLVM IR level. Please refer to the paper below. NOTE: If you publish a paper using LLTFI, please cite the following paper:

Primary LanguageC++Apache License 2.0Apache-2.0

LLTFI

LLTFI (Low-Level Tensor Fault Injector) is a unified SWiFI (Software-implemented fault injection) tool that supports fault injection of both C/C++ programs and ML applications written using high-level frameworks such as TensorFlow and PyTorch.

As machine learning (ML) has become more prevalent across many critical domains, so has the need to understand ML system resilience. While there are many ML fault injectors at the application level, there has been little work enabling fault injection of ML applications at a lower level. LLTFI is a tool that allows users to run fault injection experiments on C/C++, TensorFlow and PyTorch applications at a lower level (at the LLVM IR level). Please refer to the following paper for more information about LLTFI.

LLTFI is built on top of LLFI and is fully backward compatible with it.

LLFI

LLFI is an LLVM-based fault injection tool, that injects faults into the LLVM IR of the application source code. The faults can be injected into specific program points, and the effect can be easily tracked back to the source code. LLFI is typically used to map fault characteristics back to source code and hence understand source level or program characteristics for various kinds of fault outcomes. Detailed documentation about LLFI can be found at: https://github.com/DependableSystemsLab/LLFI/wiki. Because LLTFI is designed to be backward compatible with LLFI, the basic setup instructions for LLTFI are similar to those of LLFI. However, there are additional steps and dependencies for running ML programs.

LLTFI Workflow:

High-level ML models need to be lowered to intermediate representation (IR) for fault injection. LLTFI provides a single script that converts ML models into LLVM IR, using several publicly available tools and performs fault injection. LLTFI first lowers ML models to MLIR (Multi-Level Intermediate Representation) using ONNX-MLIR before converting to LLVM IR. The reasons for choosing MLIR are MLIR's ability to better preserve the semantics of ML models, its integration with LLVM, testability and easier extensibility.

Workflow Diagram of LLTFI:

Alt text

  • LLTFI first converts all ML models to the ONNX format. ONNX’s open exchange format allows LLTFI to support both TensorFlow and PyTorch.
  • Then, the ONNX file is converted into MLIR through ONNX-MLIR.
  • Finally, we convert MLIR into LLVM IR, using the mlir-translate tool in LLVM 15.0.

LLTFI can now inject faults into the LLVM IR, alike lowered C/C++ programs.

The LLFI tool was originally written for LLVM 3.4. While developing LLTFI, the entire LLFI tool was upgraded to LLVM 15.0 because LLVM 3.4 has no support for MLIR. This upgrade also ensured that LLTFI is compatible all of the newest C/C++ features, and LLVM optimization passes

Auto-Installer

If you wish to build LLTFI and its dependencies via the auto-installer(installer/InstallLLTFI.py), you do not need to clone the LLTFI git repository. Simply download the installer script by itself, and it will fetch the latest version of the git repository for you. To run the script, simply copy it into the directory where you would like to build the LLTFI and, from the command line, run python3 InstallLLTFI.py.

Dependencies:

  1. 64 Bit Machine (preferably with GPU for faster training of ML programs)
  2. 64 bit Linux (Ubuntu 20.04) or OS X
  3. CMake (minimum v3.15)
  4. Python 3 and above
  5. Ninja >= 1.10.2
  6. Internet Connection

Usage:

  1. Copy the InstallLLTFI.py script to where you want to build the LLTFI. Run "python3 InstallLLTFI.py -h" to see all running options/guidelines
  2. Run "python3 InstallLLTFI.py"

Manual Installation

In this method, the developer has more control over the location of the LLVM build that the LLTFI requires. If you already have LLVM built, you could use that build.

Dependencies:

  1. 64 Bit Machine (preferably with GPU for faster training of ML programs)
  2. 64 bit Linux (Ubuntu 20.04) or OS X
  3. CMake (minimum v3.15)
  4. Python 3 and above
  5. Python YAML library (PyYAML v5.4.1)
  6. Ninja >= 1.10.2
  7. libprotoc >= 3.11.0
  8. Clang v15.0 (commit: 9778ec057cf4)
  9. LLVM v15.0 (commit: 9778ec057cf4) (Reference). LLVM 15.0 takes a long time to completely build. Following is a shortcut to checking out the required LLVM commit, and building only the necessary LLVM targets.
      git clone https://github.com/llvm/llvm-project.git
      
      # Check out a specific branch that is known to work with the required version of ONNX MLIR.
      cd llvm-project && git checkout 9778ec057cf4 && cd ..
      	
      mkdir llvm-project/build
      cd llvm-project/build
      
      cmake -G Ninja ../llvm \
      	-DLLVM_ENABLE_PROJECTS="clang;mlir" \
      	-DLLVM_BUILD_TESTS=ON \
      	-DLLVM_TARGETS_TO_BUILD="host" \
      	-DLLVM_ENABLE_ASSERTIONS=ON \
      	-DLLVM_ENABLE_RTTI=ON
    
      cmake --build . --target clang check-mlir mlir-translate opt llc lli llvm-dis llvm-link -j 2
    
      ninja install -j 2
    
  10. For executing ML programs, following additional dependencies have to be installed:
    1. TensorFlow framework (v2.0 or greater)

    2. numpy package (part of TensorFlow)

    3. tensorflow-onnx: Installation with pip is sufficient

      pip install tf2onnx
      
    4. libprotoc

      curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.17.2/protobuf-all-3.17.2.zip
      unzip protobuf-all-3.17.2.zip
      cd protobuf-3.17.2
      
      ./configure
      make -j 2
      make check
      sudo make install
      sudo ldconfig # refresh shared library cache.
      
    5. ONNX-MLIR

      Additional changes made in the ONNX-MLIR code are present in: https://github.com/DependableSystemsLab/onnx-mlir-lltfi. Clone this repo and checkout the LLTFI branch. The MLIR_DIR cmake variable must be set before building onnx-mlir. It should point to the mlir cmake module inside an llvm-project build or install directory (e.g., llvm-project/build/lib/cmake/mlir).

      MLIR_DIR=$(pwd)/llvm-project/build/lib/cmake/mlir
      

      Onnx-mlir branch LLTFI has to be built and installed.

      git clone --recursive https://github.com/DependableSystemsLab/onnx-mlir-lltfi.git
      mv onnx-mlir-lltfi onnx-mlir && cd onnx-mlir
      git checkout LLTFI
      cd ..
      
      mkdir onnx-mlir/build && cd onnx-mlir/build
      cmake -G Ninja \
      	-DCMAKE_CXX_COMPILER=/usr/bin/c++ \
      	-DMLIR_DIR=${MLIR_DIR} \
      	.. 
      	
      cmake --build .
      	
      # Run lit tests:
      export LIT_OPTS=-v
      cmake --build . --target check-onnx-lit
      
      ninja install
      
  11. GraphViz package (for visualizing error propagation)

Building LLTFI:

Run ./setup --help for build instructions.

  $ ./setup --help

  Usage: setup OPTIONS
  List of options:
  -LLVM_DST_ROOT <LLVM CMake build root dir>:
      Make sure you build LLVM with CMake and pass build root directory here
  -LLVM_SRC_ROOT <LLVM source root dir>
  -LLFI_BUILD_ROOT <path where you want to build LLFI>
  -LLVM_GXX_BIN_DIR <llvm-gcc/g++'s parent dir> (optional):
      You don't need to set it if it is in system path
  -JAVA_HOME_DIR <java home directory for oracle jdk 7 or higher> (optional):
    You don't need to set it if your default jdk is oracle jdk 7 or higher and in system path


  --help(-h): show help information
  --runTests: Add this option if you want to run all regression tests after building LLFI.

Below is the command to build LLTFI(without GUI) if clang is already in $PATH:

./setup -LLFI_BUILD_ROOT $BUILD/LLFI -LLVM_SRC_ROOT $SRC/llvm-15.0 -LLVM_DST_ROOT $BUILD/llvm-15.0

Details about running the Web GUI for LLTFI can be found here

Building LLTFI using Docker:

docker/Dockerfile can be used to build and run LLTFI in a docker container. You can modify the Dockerfile according to your system and project requirements. More details can be found here

Steps to build:

  1. Creating a docker image from the Dockerfile: Copy the Dockerfile to a directory of your choice outside this repository. To create an image, run the command docker build --tag imageName . in the terminal.
  2. Starting a docker container: Once the above step is completed, a docker container can be started using the command docker run -it imageName

Running tests:

Running all regression tests after installation is highly recommended. Note that you may encounter some error messages during the fault injection stage. This is normal. Once all tests have been completed and they all passed, LLFI is correctly installed.

For complete test of whole of LLFI, please use LLFI test suite and refer to the wiki page: Test suite for regression test for details.

Running Sample Programs

You can use test programs in the directory sample_programs/ or test_suite/PROGRAMS/ to test LLFI. Programs in the sample_programs directory already contain a valid input.yaml file.

Example program: factorial:

  1. Copy the sample_programs/factorial/ directory to your project directory.
  2. Set LLFI_BUILD_ROOT environment variable e.g., export LLFI_BUILD_ROOT=/path/to/LLFI/installation
  3. Call the ./compileAndRun.sh script with the first argument as factorial, and the second argument as the number to compute the factorial of (e.g., 6)

Results

After fault injection, output from LLFI and the tested application can be found in the llfi directory.

Directory Contents
std_output Piped STDOUT from the tested application
llfi_stat_output Fault injection statistics
error_output Failure reports (program crashes, hangs, etc.)
trace_report_output Faults propagation report files and graph

Reproducing the experiments in our ISSRE'23 paper

Please refer to the following README file for instructions on obtaining benchmarks and reproducing the experiments in our ISSRE'23 paper. ISSRE'23 AE

References

  • LLFI Paper
  • LLFI Wiki
  • Udit Kumar Agarwal, Abraham Chan, Karthik Pattabiraman. LLTFI: Framework agnostic fault injection for machine learning applications (Tools and Artifacts Track). International Symposium on Software Reliability Engineering (ISSRE), 2022. 10 pages. LLTFI Paper
  • Udit Kumar Agarwal, Abraham Chan, Karthik Pattabiraman. Resilience Assessment of Large Language Models under Transient Hardware Faults (PER). International Symposium on Software Reliability Engineering (ISSRE), 2023. Paper

Citations

@article{Agarwal22LLTFI,
  title={LLTFI: Framework agnostic fault injection for machine learning applications (Tools and Artifacts Track)},
  author={Agarwal, Udit and Chan, Abraham and Pattabiraman, Karthik},
  journal={International Symposium on Software Reliability Engineering (ISSRE)},
  year={2022},
  publisher={IEEE}
}

====== Read caveats.txt for caveats and known problems.