faustinoaq/prov-tracer

Jupyter Notebook

Evaluating system-level provenance tools for practical use

Paper directory
Computational provenance := how did this file get produced? What binaries, data, libraries, other files that got used? What about the computational provenance of those files?
System-level provenance collects this data without knowing anything about the underlying programs (black box); just looking at syscalls or the like.
This paper is a lit review of provenance systems

Provenance research presentation to GNU/Linux User’s Group

Presentation directory

Provenance presentation to UBC

Presentation on Google Drive

IN-PROGRESS Measuring provenance overheads

Paper directory
Take provenance systems and benchmarks from the lit review, apply all prov systems to all benchmarks
Reproducing: See REPRODUCING.md
Code directory
- prov_collectors.py contains “provenance collectors”
- workloads.py contains the “workloads”; The workloads have a “setup” and a “run” phase. For example, “setup” may download stuff (we don’t want to time the setup; that would just benchmark the internet service provider), whereas “run” will do the compile (we want to time only that).
- runner.py will select certain collectors and workloads; if it succeeds, the results get stored in .cache/, so subsequent executions with the same arguments will return instantly
- experiment.py contains the logic to run experiments (especailly cleaning up after them)
- run_exec_wrapper.py knows how to execute commands in a “clean” environment and cgroup
- Stats-larger.ipynb has the process to extract statistics using bayesian inference from the workflow runs
- flake.nix contains the Nix expressions which describe the environment in which everything runs
- result/ directory contains the result of building flake.nix; all binaries and executables should come from result/ in order for the experiment to be reproducible

Rapid review

Redo rapid review with snowballing

Include record/replay terms

Add sciunit
Add reprozip
Add DetTrace
Add CDE
Add Burrito
Add Sumatra

Get workloads to work

Get Apache to compile

We need to get src_sh{./result/bin/python runner.py apache} to work

Cannot find pcre-config

I invoke src_sh{./configure –with-pcre-config=/path/to/pcre-config}, and ./configure will still complain (“no pcre-config found”).
I ended up patching with httpd-configure.patch.

lber.h not found

/nix/store/2z0hshv096hhavariih722pckw5v150v-apr-util-1.6.3-dev/include/apr_ldap.h:79:10: fatal error: lber.h: No such file or directory

Get Spack workloads to compile

We need to get src_sh{./result/bin/python runner.py spack} to work
See docstring of SpackInstall in workloads.py.
Spack installs a target package (call it $spec) and all of $spec’s dependencies. Then it removes $spec, while leaving the dependencies.

Write a `Workload` class for Apache + ApacheBench

Compiling Apache is an interesting benchmark, but running Apache with a predefined request load is also an interesting benchmark.
We should write a new class called ApacheLoad that installs Apache in its setup() (for simplicity, we won’t reuse the version we built earlier), downloads a ApacheBench, and in the run() runs the server with the request load using only tools from result/ or .work/.

BACKLOG THTTPD and cherokee

http://www.acme.com/software/thttpd/ https://github.com/larryhe/tinyhttpd https://github.com/mendsley/tinyhttp https://cherokee-project.com/

BACKLOG SPEC CPU 2006

Determine if we need just int or also fp benchmarks
https://www.spec.org/cpu2006/Docs/
https://www.spec.org/sources/
https://github.com/miyuki/spec-cpu2006-redist/
https://www.spec.org/cpu2006/Docs/tools-build.html
https://www.spec.org/cpu2006/Docs/install-guide-unix.html
https://www.spec.org/cpu2006/Docs/runspec.html

SSH

https://github.com/LineRate/ssh-perf

Shellbench

https://github.com/shellspec/shellbench

CleanML

https://chu-data-lab.github.io/CleanML/

Create Postmark workload

https://www.filesystems.org/docs/auto-pilot/Postmark.html
See Hi-Fi, PASSv2, LPM, CamFlow for details
pm>set transactions 400000

Create lmbench benchmark

https://lmbench.sourceforge.net/

Create filebench benchmark

https://github.com/filebench/filebench

Native snakemake and nf-core workflows

Make browser benchmarks

Run Chromium and Firefox with Sunspider
https://github.com/v8/v8/blob/04f51bc70a38fbea743588e41290bea40830a486/test/benchmarks/csuite/csuite.py#L4

Create mercurial/VCS workload

VIC

FIE

Write a ProFTPD benchmark

https://github.com/selectel/ftpbench

IN-PROGRESS Write a `CompileLinux` class

Write a class that compiles the Linux kernel (just the kernel, no user-space software), using only tools from result/.
The benchmark should use a specific pin of the Linux kernel and set kernel build options. Both should be customizable and set by files that are checked into Git. However, the Linux source tree should not be checked into Git (see build Apache, where I download the source code in setup() and cache it for future use).

Refactor BLAST workloads

It should be easy to run them a large consistent set of many different BLAST apps.
Maybe have a 1 min, 10 min, and 60 min randomly-selected, but fixed, configuration

Investigate Sysbench

https://doi.org/10.1145/2508859.2516731

investigate BT-IO

https://www.nas.nasa.gov/software/npb.html

Run xSDK codes

Spark workload

https://www.databricks.com/blog/2017/10/05/build-complex-data-pipelines-with-unified-analytics-platform.html

yt workloads

https://yt-project.org/doc/cookbook/index.html https://prappleizer.github.io/#tutorials https://trident.readthedocs.io/en/latest/annotated_example.html https://github.com/PyLCARS/YT_BeyondAstro

Make API easier to use

Refactor `runner.py`

Change to run_store_analyze.py
runner.py mixes code for selecting benchmarks and prov collectors with code for summarizing statistical outputs.
Use –benchmarks and –collectors to form a grid
Accept –iterations, –seed, –fail-first
Accept –analysis $foo
Should have an –option to import external workloads and prov_collectors
Should have –re-run, which removes .cache/results_* and .cache/$hash

Refactor `run_exec_wrapper.py`

Should fail gracefully when cgroups are not available

Refactor `stats.py`

Should have Callable[pandas.DataFrame, None]

Refactor `prov_collectors.py`

Should have teardown

Refactor `workloads.py`

Should accept a tempdir
Should be smaller
Should have teardown
Should export instances
Categories: build Spack, build Deb, BLAST, compile Linux, HTTP bench, FTP bench
Should have license()

Write `run.py`

Just runs one workload
–setup, –main, –teardown

Document user interface

Make easier to install

Allow classes to specify Nix packages

But user should be able to customize lockfile
setup() should do nix build and add to path

Package Python code for PyPI using Poetry

Document installation

Provenance collectors

Fix Sciunits

We need to get src_sh{./result/bin/python runner.py sciunit} to work.
Sciunit is a Python package which depends on a binary called ptu.
Sciunit says “sciunit: /nix/store/7x6rlzd7dqmsa474j8ilc306wlmjb8bp-python3-3.10.13-env/lib/python3.10/site-packages/sciunit2/libexec/ptu: No such file or directory”, but on my system, that file does exist! Why can’t sciunits find it?
Answer: That file exists; it is an ELF binary, it’s “interpreter” is set to /lib64/linux-something.so. That interpreter does not exist. I replaced this copy of ptu with the nix-built copy of ptu.

Fix sciunit

Fix strace unparsable lines

Fix rr to measure storage overhead

Fix Spade+FUSE

We need to get src_sh{./result/bin/python runner.py spade_fuse} to work.

Get SPADE Neo4J database to work

src_sh{./result/bin/spade start && echo “add storage Neo4J $PWD/db” | ./result/bin/spade control}
Currently, that fails with “Adding storage Neo4J… error: Unable to find/load class”
The log can be found in ~~/.local/share/SPADE/current.log~.
~/.local/share/SPADE/lib/neo4j-community/lib/*.jar contains Neo4J classes. I believe these are on the classpath. However, this is a different version of Java or something like that, which refuses to load those jars.

[#C] Write BPF trace

We need to write a basic prov collector for BPF trace. The collector should log files read/written by the process and all children processes. Start by writing prov.bt.

Package CARE

https://proot-me.github.io/care/

Write a sleepy ptracer

Package/write-up PTU

https://www.usenix.org/system/files/conference/tapp13/tapp13-final18.pdf

discuss VAMSA

https://dl.acm.org/doi/pdf/10.1145/3394486.3403205

Build CentOS packages

See @shiExperienceReportProducing2022. Could leverage https://pypi.org/project/reprotest/

Stats

Measure arithmetic intensity for each

IO calls / CPU sec, where CPU sec is itself a random variable

Measure slowdown as a function of arithmetic intensity

See States-larger.ipynb

[#C] Count dynamic instructions in entire program

IO calls / 1M dynamic instruction

Plot IO vs CPU sec

Plot confidence interval of slowdown per arithmetic intensity

Evaluate prediction based on arithmetic intensity

slowdown(prov_collector) * cpu_to_wall_time(workload) * runtime(workload) ~ runtime(workload, prov_collector)
What is the expected percent error?

Characterize benchmarks and benchmark classes by syscall breakdown

Features: count of each group of syscalls / total time
Prog should occupy the same point as {Prog, Prog} (that is, analogous to intensive not extensive properties in physics)
PCA and clustering and dendrogram
- Sec 3 of https://doi.org/10.1109/ISPASS.2005.1430555
- Sec 9 of https://doi.org/10.1145/1167473.1167488
https://doi.org/10.1109/IISWC.2006.302733

Revise bayesian model to use benchmark class

How many classes and benchmarks does one need?

Writing

Write introduction

Write background

Write literature rapid review section

Write benchmark and prov collector collection

Revise introduction (60)

Smoosh Motivation and Background together
Lead with the problem
1 problem -> provenance (vs perf overhead) -> 3 other problems solved -> 3 ways to gather

Explain how strace, ltrace, fsatrace, rr got to be there

Explain how Sciunits, ReproZip got to be there

Describe experimental results

Explain the capabilities/features of each prov tracer

Table of capabilities (vDSO)

Discussion

What provenance methods are most promising?
Threats to validity
Mathematical model
Few of the tools are applicable to comp sci due to methods
How many work for distributed systems
How to handle network

Story-telling

Gaps in prior work re comp sci
Stakeholder perspectives:
- Tool developers, users, facilities people
Longterm archiving an execution, such that it is re-executable
I/O defn? I/O includes stuff like username, clock_gettime

Conclusion

Threats to validity

Background

Page-limit

Reproducibility appendix

BACKLOG Record/replay reproducibility with library interposition

Paper directory
Record/replay is an easier way to get reproducibility than Docker/Nix/etc.
Use library interpositioning to make a record/replay tool that is faster than other record/replay tools

BACKLOG Get global state vars

Library constructors get called twice (2 copies of library global variables)
https://stackoverflow.com/questions/77782964/how-to-run-code-exactly-once-in-ld-preloaded-shared-library

Vars