PROBE: Provenance for Replay OBservation Engine

This program executes and monitors another program, recording its inputs and outputs using $LD_PRELOAD.

These inputs and outputs can be joined in a provenance graph.

The provenance graph tells us where a particular file came from.

The provenance graph can help us re-execute the program, containerize the program, turn it into a workflow, or tell us which version of the data did this program use.

Installing PROBE

Install Nix with flakes. This can be done on any Linux (including Ubuntu, RedHat, Arch Linux, not just NixOS), MacOS X, or even Windows Subsystem for Linux.
- If you don't already have Nix on your system, use the Determinate Systems installer.
- If you already have Nix (but not NixOS), enable flakes by adding the following line to ~/.config/nix/nix.conf or /etc/nix/nix.conf:
```
experimental-features = nix-command flakes
```
- If you already have Nix and are running NixOS, enable flakes with by adding nix.settings.experimental-features = [ "nix-command" "flakes" ]; to your configuration.
If you want to avoid a time-consuming build, add our public cache.
```
nix profile install --accept-flake-config nixpkgs#cachix
cachix use charmonium
```
If you want to build from source (e.g., for security reasons), skip this step.
Run nix profile install github:charmoniumQ/PROBE#probe-bundled.
Now you should be able to run probe record [-f] [-o probe_log] <cmd...>, e.g., probe record ./script.py --foo bar.txt. See below for more details.
To view the provenance, run probe dump [-i probe_log]. See below for more details.
Run probe --help for more details.

What does `probe record` do?

The simplest invocation of the probe cli is:

probe record <CMD...>

This will run <CMD...> under the benevolent supervision of libprobe, outputting the probe record to a temporary directory. Upon the process exiting, probe it will transcribe the record directory and write a probe log file named probe_log in the current directory.

If you run this again you'll notice it throws an error that the output file already exists, solve this by passing -o <PATH> to specify a new file to write the log to, or by passing -f to overwrite the previous log.

probe record does not pass your command through a shell, any subshell or environment substitutions will still be performed by your shell before the arguments are passed to probe. But it won't understand flow control statements like if and for, shell builtins like cd, or shell aliases/functions.

If you need these you can either write a shell script and invoke probe record on that, or else run:

probe record bash -c '<SHELL_CODE>'

Any flag after the first positional argument is treated as an argument to the command, not probe.

This creates a file called probe_log. If you already have that file from a previous recording, give probe record -f to overwrite.

If you get tired of typing probe record ... in front of every command you wish to record, consider recording your entire shell session:

$ probe record bash
bash$ ls -l
bash$ # do other commands
bash$ exit

$ probe dump
<dumps history for entire bash session>

What can I do with provenance?

That's a huge work in progress.

Try exporting to different formats.

probe export --help

Developing PROBE

Follow the previous step to install Nix.
Acquire the source code: git clone https://github.com/charmoniumQ/PROBE && cd PROBE
Run nix develop. This will leave you in a Nix development shell, with all the development tools you need to develop and build PROBE. It is like a virtualenv, in that it is isolated from your system's pre-existing tools. In the development shell, we all have the same version of Python with all the same packages. You can exit it by dyping exit.
From within the development shell, type just compile. This compiles the Rust, C, and generated-Python components. If you hack on either, run just compile again before continuing.
The manually-written Python scripts should already be added to the $PYTHONPATH. You should be able to edit them in place.
Run probe <args...> or python -m probe_py.manual.cli <args...> to invoke the Rust or Python code respectively.
Before submitting a PR, run just pre-commit which will run pre-commit checks.

Resarch reading list

Provenance for Computational Tasks: A Survey by Freire, et al. in CiSE 2008 for an overview of provenance in general.
Transparent Result Caching by Vahdat and Anderson in USENIX ATC 1998 for an early system-level provenance tracer in Solaris using the /proc fs. Linux's /proc fs doesn't have the same functionality. However, this paper discusses two interesting application of provenance: unmake (query lineage information) and transparent Make (more generally, incremental computation).
CDE: Using System Call Interposition to Automatically Create Portable Software Packages by Guo and Engler in USENIX ATC 2011 for an early system-level provenance tracer. Their only application is software execution replay, but replay is quite an important application.
Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness? by Thain, Meng, and Ivie in 2015 discusses whether enabling automatic-replay is actually a good idea. A cursory glance makes PROBE seem more like "preserving the mess", but I think, with some care in the design choices, it actually can be more like "encouraging cleanliness", for example, by having heuristics that help cull/simplify provenance and generating human readable/editable package-manager recipes.
SoK: History is a Vast Early Warning System: Auditing the Provenance of System Intrusions by Inam et al. in IEEE Symposium on Security and Privacy 2023 see specifically Inam's survey of different possibilities for the "Capture layer", "Reduction layer", and "Infrastructure layer". Although provenance-for-security has different constraints than provenacne for other purposes, the taxonomy that Inam lays out is still useful. PROBE operates by intercepting libc calls, which is essentially a "middleware" in Table I (platform modification, no program modification, no config change, incomplete mediation, not tamperproof, inter-process tracing, etc.).
System-Level Provenance Tracers by me et al. in ACM REP 2023 for a motivation of this work. It surveys prior work, identifies potential gaps, and explains why I think library interposition is a promising path for future research.
Computational Experiment Comprehension using Provenance Summarization by Bufford et al. in ACM REP 2023 discusses how to implement an interface for querying provenance information. They compare classical graph-based visualization with an interactive LLM in a user-study.

Prior art

RR-debugger which is much slower, but features more complete capturing, lets you replay but doesn't let you do any other analysis.
Sciunits which is much slower, more likely to crash, has less complete capturing, lets you replay but doesn't let you do other analysis.
Reprozip which is much slower and has less complete capturing.
CARE which is much slower, has less complete capturing, and lets you do containerized replay but not unpriveleged native replay and not other analysis.
FSAtrace which is more likely to crash, has less complete capturing, and doesn't have replay or other analyses.

charmoniumQ/PROBE