/prov-tracer

Primary LanguageJupyter Notebook

Evaluating system-level provenance tools for practical use

  • Paper directory
  • Computational provenance := how did this file get produced? What binaries, data, libraries, other files that got used? What about the computational provenance of those files?
  • System-level provenance collects this data without knowing anything about the underlying programs (black box); just looking at syscalls or the like.
  • This paper is a lit review of provenance systems

Provenance research presentation to GNU/Linux User’s Group

Provenance presentation to UBC

  • Presentation on Google Drive

IN-PROGRESS Measuring provenance overheads

  • Paper directory
  • Take provenance systems and benchmarks from the lit review, apply all prov systems to all benchmarks
  • Reproducing: See REPRODUCING.md
  • Code directory
    • prov_collectors.py contains “provenance collectors”
    • workloads.py contains the “workloads”; The workloads have a “setup” and a “run” phase. For example, “setup” may download stuff (we don’t want to time the setup; that would just benchmark the internet service provider), whereas “run” will do the compile (we want to time only that).
    • runner.py will select certain collectors and workloads; if it succeeds, the results get stored in .cache/, so subsequent executions with the same arguments will return instantly
    • experiment.py contains the logic to run experiments (especailly cleaning up after them)
    • run_exec_wrapper.py knows how to execute commands in a “clean” environment and cgroup
    • Stats-larger.ipynb has the process to extract statistics using bayesian inference from the workflow runs
    • flake.nix contains the Nix expressions which describe the environment in which everything runs
    • result/ directory contains the result of building flake.nix; all binaries and executables should come from result/ in order for the experiment to be reproducible

Rapid review

Redo rapid review with snowballing

Include record/replay terms

  • Add sciunit
  • Add reprozip
  • Add DetTrace
  • Add CDE
  • Add Burrito
  • Add Sumatra

Get workloads to work

Get Apache to compile

  • We need to get src_sh{./result/bin/python runner.py apache} to work

Cannot find pcre-config

  • I invoke src_sh{./configure –with-pcre-config=/path/to/pcre-config}, and ./configure will still complain (“no pcre-config found”).
  • I ended up patching with httpd-configure.patch.

lber.h not found

  • /nix/store/2z0hshv096hhavariih722pckw5v150v-apr-util-1.6.3-dev/include/apr_ldap.h:79:10: fatal error: lber.h: No such file or directory

Get Spack workloads to compile

  • We need to get src_sh{./result/bin/python runner.py spack} to work
  • See docstring of SpackInstall in workloads.py.
  • Spack installs a target package (call it $spec) and all of $spec’s dependencies. Then it removes $spec, while leaving the dependencies.

Write a Workload class for Apache + ApacheBench

  • Compiling Apache is an interesting benchmark, but running Apache with a predefined request load is also an interesting benchmark.
  • We should write a new class called ApacheLoad that installs Apache in its setup() (for simplicity, we won’t reuse the version we built earlier), downloads a ApacheBench, and in the run() runs the server with the request load using only tools from result/ or .work/.

BACKLOG THTTPD and cherokee

http://www.acme.com/software/thttpd/ https://github.com/larryhe/tinyhttpd https://github.com/mendsley/tinyhttp https://cherokee-project.com/

BACKLOG SPEC CPU 2006

SSH

https://github.com/LineRate/ssh-perf

Shellbench

https://github.com/shellspec/shellbench

CleanML

https://chu-data-lab.github.io/CleanML/

Create Postmark workload

Create lmbench benchmark

Create filebench benchmark

Native snakemake and nf-core workflows

Make browser benchmarks

Create mercurial/VCS workload

VIC

FIE

Write a ProFTPD benchmark

IN-PROGRESS Write a CompileLinux class

  • Write a class that compiles the Linux kernel (just the kernel, no user-space software), using only tools from result/.
  • The benchmark should use a specific pin of the Linux kernel and set kernel build options. Both should be customizable and set by files that are checked into Git. However, the Linux source tree should not be checked into Git (see build Apache, where I download the source code in setup() and cache it for future use).

Refactor BLAST workloads

  • It should be easy to run them a large consistent set of many different BLAST apps.
  • Maybe have a 1 min, 10 min, and 60 min randomly-selected, but fixed, configuration

Investigate Sysbench

investigate BT-IO

https://www.nas.nasa.gov/software/npb.html

Run xSDK codes

Spark workload

https://www.databricks.com/blog/2017/10/05/build-complex-data-pipelines-with-unified-analytics-platform.html

yt workloads

https://yt-project.org/doc/cookbook/index.html https://prappleizer.github.io/#tutorials https://trident.readthedocs.io/en/latest/annotated_example.html https://github.com/PyLCARS/YT_BeyondAstro

Make API easier to use

Refactor runner.py

  • Change to run_store_analyze.py
  • runner.py mixes code for selecting benchmarks and prov collectors with code for summarizing statistical outputs.
  • Use –benchmarks and –collectors to form a grid
  • Accept –iterations, –seed, –fail-first
  • Accept –analysis $foo
  • Should have an –option to import external workloads and prov_collectors
  • Should have –re-run, which removes .cache/results_* and .cache/$hash

Refactor run_exec_wrapper.py

  • Should fail gracefully when cgroups are not available

Refactor stats.py

  • Should have Callable[pandas.DataFrame, None]

Refactor prov_collectors.py

  • Should have teardown

Refactor workloads.py

  • Should accept a tempdir
  • Should be smaller
  • Should have teardown
  • Should export instances
  • Categories: build Spack, build Deb, BLAST, compile Linux, HTTP bench, FTP bench
  • Should have license()

Write run.py

  • Just runs one workload
  • –setup, –main, –teardown

Document user interface

Make easier to install

Allow classes to specify Nix packages

  • But user should be able to customize lockfile
  • setup() should do nix build and add to path

Package Python code for PyPI using Poetry

Document installation

Provenance collectors

Fix Sciunits

  • We need to get src_sh{./result/bin/python runner.py sciunit} to work.
  • Sciunit is a Python package which depends on a binary called ptu.
  • Sciunit says “sciunit: /nix/store/7x6rlzd7dqmsa474j8ilc306wlmjb8bp-python3-3.10.13-env/lib/python3.10/site-packages/sciunit2/libexec/ptu: No such file or directory”, but on my system, that file does exist! Why can’t sciunits find it?
  • Answer: That file exists; it is an ELF binary, it’s “interpreter” is set to /lib64/linux-something.so. That interpreter does not exist. I replaced this copy of ptu with the nix-built copy of ptu.

Fix sciunit

Fix strace unparsable lines

Fix rr to measure storage overhead

Fix Spade+FUSE

  • We need to get src_sh{./result/bin/python runner.py spade_fuse} to work.

Get SPADE Neo4J database to work

  • src_sh{./result/bin/spade start && echo “add storage Neo4J $PWD/db” | ./result/bin/spade control}
  • Currently, that fails with “Adding storage Neo4J… error: Unable to find/load class”
  • The log can be found in ~~/.local/share/SPADE/current.log~.
  • ~/.local/share/SPADE/lib/neo4j-community/lib/*.jar contains Neo4J classes. I believe these are on the classpath. However, this is a different version of Java or something like that, which refuses to load those jars.

[#C] Write BPF trace

  • We need to write a basic prov collector for BPF trace. The collector should log files read/written by the process and all children processes. Start by writing prov.bt.

Package CARE

https://proot-me.github.io/care/

Write a sleepy ptracer

Package/write-up PTU

discuss VAMSA

Build CentOS packages

Stats

Measure arithmetic intensity for each

  • IO calls / CPU sec, where CPU sec is itself a random variable

Measure slowdown as a function of arithmetic intensity

[#C] Count dynamic instructions in entire program

  • IO calls / 1M dynamic instruction

Plot IO vs CPU sec

Plot confidence interval of slowdown per arithmetic intensity

Evaluate prediction based on arithmetic intensity

  • slowdown(prov_collector) * cpu_to_wall_time(workload) * runtime(workload) ~ runtime(workload, prov_collector)
  • What is the expected percent error?

Characterize benchmarks and benchmark classes by syscall breakdown

Revise bayesian model to use benchmark class

  • How many classes and benchmarks does one need?

Writing

Write introduction

Write background

Write literature rapid review section

Write benchmark and prov collector collection

Revise introduction (60)

  • Smoosh Motivation and Background together
  • Lead with the problem
  • 1 problem -> provenance (vs perf overhead) -> 3 other problems solved -> 3 ways to gather

Explain how strace, ltrace, fsatrace, rr got to be there

Explain how Sciunits, ReproZip got to be there

Describe experimental results

Explain the capabilities/features of each prov tracer

  • Table of capabilities (vDSO)

Discussion

  • What provenance methods are most promising?
  • Threats to validity
  • Mathematical model
  • Few of the tools are applicable to comp sci due to methods
  • How many work for distributed systems
  • How to handle network

Story-telling

  • Gaps in prior work re comp sci
  • Stakeholder perspectives:
    • Tool developers, users, facilities people
  • Longterm archiving an execution, such that it is re-executable
  • I/O defn? I/O includes stuff like username, clock_gettime

Conclusion

Threats to validity

Background

Page-limit

Reproducibility appendix

BACKLOG Record/replay reproducibility with library interposition

  • Paper directory
  • Record/replay is an easier way to get reproducibility than Docker/Nix/etc.
  • Use library interpositioning to make a record/replay tool that is faster than other record/replay tools

BACKLOG Get global state vars

Vars