/PROBE

Primary LanguageJupyter Notebook

PROBE: Provenance for Replay OBservation Engine

This program executes and monitors another program, recording its inputs and outputs using $LD_PRELOAD.

These inputs and outputs can be joined in a provenance graph.

The provenance graph tells us where a particular file came from.

The provenance graph can help us re-execute the program, containerize the program, turn it into a workflow, or tell us which version of the data did this program use.

Reading list

Installing PROBE

  1. Install Nix with flakes. This can be done on any Linux (including Ubuntu, RedHat, Arch Linux, not just NixOS), MacOS X, or even Windows Subsystem for Linux.

    • If you don't already have Nix on your system, use the Determinate Systems installer.

    • If you already have Nix (but not NixOS), enable flakes by adding the following line to ~/.config/nix/nix.conf or /etc/nix/nix.conf:

      experimental-features = nix-command flakes
      
    • If you already have Nix and are running NixOS, enable flakes with by adding nix.settings.experimental-features = [ "nix-command" "flakes" ]; to your configuration.

  2. Run nix env -i github:charmoniumQ/PROBE#probe-bundled.

  3. Now you should be able to run probe record [-f] [-o probe_log] <cmd...>, e.g., probe record ./script.py --foo bar.txt. See below for more details.

  4. To view the provenance, run probe dump [-i probe_log]. See below for more details.

  5. Run probe --help for more details.

What does probe record do?

The simplest invocation of the probe cli is:

probe record <CMD...>

This will run <CMD...> under the benevolent supervision of libprobe, outputting the probe record to a temporary directory. Upon the process exiting, probe it will transcribe the record directory and write a probe log file named probe_log in the current directory.

If you run this again you'll notice it throws an error that the output file already exists, solve this by passing -o <PATH> to specify a new file to write the log to, or by passing -f to overwrite the previous log.

probe record does not pass your command through a shell, any subshell or environment substitutions will still be performed by your shell before the arguments are passed to probe. But it won't understand flow control statements like if and for, shell builtins like cd, or shell aliases/functions.

If you need these you can either write a shell script and invoke probe record on that, or else run:

probe record bash -c '<SHELL_CODE>'

(any flag after the first positional argument is treated as an argument to the command, not probe).

If you get tired of typing probe record ... in front of every command you wish to record, consider recording your entire shell session:

$ probe record bash
bash$ ls -l
bash$ # do other commands
bash$ exit

$ probe dump
<dumps history for entire bash session> 

What can I do with provenance?

That's a huge work in progress.

We're starting out with just "analysis" of the provenance. Does this input file influence that output file in the PROBEd process? Run

nix shell nixpkgs#graphviz github:charmoniumQ/PROBE#probe-py-manual \
    --command sh -c 'python -m probe_py.manual.cli process-graph | tee /dev/stderr | dot -Tpng -ooutput.png /dev/stdin'

Developing PROBE

  1. Follow the previous step to install Nix.

  2. Acquire the source code: git clone https://github.com/charmoniumQ/PROBE && cd PROBE

  3. Run nix develop. This will leave you in a Nix development shell, with all the development tools you need to develop and build PROBE. It is like a virtualenv, in that it is isolated from your system's pre-existing tools. In the development shell, we all have the same version of Python with all the same packages. You can exit it by dyping exit.

  4. From within the development shell, type just compile. This compiles the Rust, C, and generated-Python components. If you hack on either, run just compile again before continuing.

  5. The manually-written Python scripts should already be added to the $PYTHONPATH. You should be able to edit them in place.

  6. Run probe <args...> or python -m probe_py.manual.cli <args...> to invoke the Rust or Python code respectively.

Prior art

  • RR-debugger which is much slower, but features more complete capturing, lets you replay but doesn't let you do any other analysis.

  • Sciunits which is much slower, more likely to crash, has less complete capturing, lets you replay but doesn't let you do other analysis.

  • Reprozip which is much slower and has less complete capturing.

  • CARE which is much slower, has less complete capturing, and lets you do containerized replay but not unpriveleged native replay and not other analysis.

  • FSAtrace which is more likely to crash, has less complete capturing, and doesn't have replay or other analyses.