Reads-From Fuzzer (RFF)

Reads-From Fuzzer (RFF) is a tool for concurrency testing. See our paper from ASPLOS 2024 for details!

If you use our work for academic research, please cite our paper:

@inproceedings{10.1145/3620665.3640389,
author = {Wolff, Dylan and Shi, Zheng and Duck, Gregory J. and Mathur, Umang and Roychoudhury, Abhik},
title = {Greybox Fuzzing for Concurrency Testing},
year = {2024},
isbn = {9798400703850},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3620665.3640389},
doi = {10.1145/3620665.3640389},
booktitle = {Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2},
pages = {482–498},
numpages = {17},
location = {<conf-loc>, <city>La Jolla</city>, <state>CA</state>, <country>USA</country>, </conf-loc>},
series = {ASPLOS '24}
}

Please note that all scripts should be run from the base directory of the repo, as they may contain relative paths

Dependencies

Core dependencies

Docker and Python3, e.g. on Ubuntu:

sudo snap install docker
sudo groupadd docker
sudo usermod -aG docker $USER

sudo apt install -y python3

python3 -m venv .venv
pip install -r requirements.txt
source .venv/bin/activate

sudo ./afl-complain.sh

The AFL system configuration steps may not work on WSL. If this is the case, you can add the AFL_SKIP_CPUFREQ=1 into all docker containers (e.g. using -e with docker run).

Recommended dependencies

GNU parallel

sudo apt install -y parallel

A recent Ubuntu version is recommended for a host system to ensure compatibility with various commands and convenience scripts to run experiments.

Outside of Docker

The RFF tool itself is completely set up in Docker. To run outside of Docker, see the dockerfiles/Dockerfile.base for the general system setup and dependencies.

Setup

Use docker to build benchmarks and real-world programs to run schedfuzz! Running ./build_ck builds all PERIOD benchmarks (SCTBench and ConVul) for RFF (may take up to 30 mins).

To build and run a fuzzing container individually:

docker build -f dockerfiles/Dockerfile.base -t schedfuzz-base .
docker build -f dockerfiles/Dockerfile.sum -t sum-fuzzer .
mkdir out
sudo ./afl-complain.sh
docker run -t -v $(pwd)/out:/opt/out sum-fuzzer

There is also a run_one.sh convenience script to run a container with a volume mounted in and forward some environment variables for the PERIOD benchmarks.

To re-run schedules from the fuzzing run, or do other more interactive exploration, pop a shell in the container (use docker exec and the container name instead of docker run and the image if container is already running).

docker run -it -v $(pwd)/out:/opt/out sum-fuzzer /bin/bash

Another convenience script ./run_dev.sh will start a container with a mounted volume and pop an interactive shell in it.

Once you are inside the container, you can manually instrument and fuzz programs.

To instrument a program, use ./instrument.sh <path to compiled prog> in the container.
To fuzz the instrumented program manually set any necessary environment variables and use:

./fuzz.sh -i afl-in -o afl-out -d -- <instrumented program> <program arguments>

See the various dockerfiles in ./scalability for examples

To run the scheduler without instrumentation:

LD_PRELOAD=$(pwd)/libsched.so <path to compiled prog.> <program arguments>

By passing the SCHEDULE=<path to schedule> environment variable, you can replay a schedule generated by RFF.

Organization

sched-fuzz -- contains the latest version of the RFF tool

sched.c -- the binary instrumentation hook
libsched.cpp -- the dynamic library that wraps pthread to do schedule serialization and control; does most of the heavy lifting
AFL-2.57b -- our modified version of AFL which drives the "schedule fuzzing" (fuzzing loop and mutations)

alternative-tools -- the PCT and QLearning implementations are on different branches of the main repo, but are included in this directory for convenience of the artifact evaluation

SCTBench and ConVul (PERIOD Benchmarks)

Before running the PERIOD benchmarks, make sure to build the docker container, set up your system for AFL, and disable ASLR:

./build_ck.sh
sudo ./afl-complain.sh

For these benchmarks, you can use the environment variables FUZZERS and TARGET_KEY to select the fuzzers and target programs within that benchmark. The FUZZERS should be a comma separated list (no spaces). These environment variables can be used in conjunction with the run_one.sh script. e.g.:

FUZZERS=pos-only-schedfuzz,schedfuzz TARGET_KEY=CS/reorder_5 ./run_one.sh period

Other parameters, such as the AFL_TIMEOUT and TIME_BUDGET can be set in the container directly using the -e option for docker run (not using the run_one.sh script).

Experiments from the Paper

E1

Before conducting the experiments, build the images for RFF, PCT and QLearning by running ./build_all.sh. Also make sure to clean up any stray volumes created by prior runs or experimentation. To start the experiment, run the ./e1.sh script from the base directory of the project. By default this will run at the parallelism of your system, but the bash script can be edited to restrict the number of cores used. Note that this script does require sudo to change ownership of the files output because the docker users is set to root. The experiment will generate many docker volumes containing result.json files with the run statistics. These are then aggregated by scripts/period/parse-agg.py into a new full_data.csv file on your host system with the raw experimental results. The scripts/data-analysis/analyze.py script will then post-process this data, printing a Latex table to stdout and outputing a PNG file in assets/cum-scheds-to-bug.png. This experiment will take (54 programs / N cpus) * 5 minutes * 20 trials * 4 tools.

E2

Again, make sure to clean up any stray volumes before running, and make sure the base RFF image is built (./build_ck.sh). Run the ./e2.sh script from the root directory of the project. This will copy the raw read-from pair and read-from sequence hash data to the scripts/data-analysis/freq-data directory. Note that this script also needs sudo to change ownership of output files from docker. It will then run scripts/data-analysis/bar-freq.py to process the data generate plots in assets/bar.png. This experiment should take only a few minutes (<15).

Ablation Studies / Parameters

The behavior of the fuzzer can be changed by environment variables. There are also convenient "abbreviations" for the PERIOD benchmarks than can be appended to the fuzzer name in the FUZZERS environment variable (e.g. in run_one.sh). Some are below, see scripts/period/one.py for a more exhaustive list.

| abbreviation          | environment var + value | desc                                                                 |
|-----------------------|-------------------------|----------------------------------------------------------------------|
| no-afl-cov            | NO_AFL_COV=1            | No control flow feedback                                             |
| no-pos                | NO_POS=1                | No partial order sampling                                            |
| pos-only              | POS_ONLY=1              | Only partial order sampling                                          |
| depth-3               | MAX_DEPTH=3             | Only RF schedules of length 3 or less                                |
| always-rand           | ALWAYS_RAND=1           | Always change random seed on schedule mutation                       |
| max-sp                | SCORE_PATTERN=max       | Race predictor scoring                                               |
| avg-sp                | SCORE_PATTERN=avg       | Race predictor scoring                                               |
| thread-affinity-200   | THREAD_AFFINITY=200     | bias somewhat towards not switching threads                          |
| thread-affinity--200  | THREAD_AFFINITY=-200    | bias somewhat towards switching threads                              |
| thread-affinity-500   | THREAD_AFFINITY=500     | bias more towards not switching threads                              |
| thread-affinity--500  | THREAD_AFFINITY=-500    | bias more towards switching threads                                  |
| thread-affinity-800   | THREAD_AFFINITY=800     | bias heavily towards not switching threads                           |
| thread-affinity--800  | THREAD_AFFINITY=-800    | bias heavily towards switching threads                               |
| max-multi-mutations-2 | MAX_MULTI_MUTATIONS=2   | allow insertion, deletion etc. of two RF's at once                   |
| max-multi-mutations-3 | MAX_MULTI_MUTATIONS=3   | allow insertion, deletion etc. of three RF's at once                 |
| all-pairs             | ALL_PAIRS=1             | don't take pairs to flip in schedule from race predictor             |
| all-rff               | ALL_RFF=1               | get RF feedback from all observer RF's (not just those in schedule)  |
| power-coe             | POWER_COE=1             | give extra weight to rare RF's observed                              |
| stage-max-128         | SCHED_STAGE_MAX=128     | increase number of schedules explored in each "stage" of fuzzing     |
|                       | JSON_SCHEDULE=1         | use a human readable JSON file to record each schedule (slow!)       |

Some common configurations:

power-coe-always-rand-schedfuzz -- i.e. our approach

pos-only-schedfuzz -- partial order sampling

Scalability

To run on some load/store heavy real world programs, naive binary instrumentation of all memory operations can be very expensive. In most cases, only a very small subset of these operations are accessing memory that is shared across multiple threads. To filter out the unenessaryh instrumentation, there is sched-fuzz/selective-instrument.sh script which takes in a subset of instruction offsets in the binary to instrument. This subset can be obtained from analyzing a full trace of the program, which can take an extremely long time to gather for large, load/store heavy programs (many hours). Note that this step is not optimized and not used for any of the experiments in the paper, as the programs provided in the benchmarks are not as load/store heavy. For the SQLite program in the scalability directory, built with python ./build_scale.py, this has already been done. Alternatively, you can skip the instrumentation step altogether, as in the x264 example in the scalability directory. RFF will still be able to serialize and test the program, just not at the granularity of individual memory operations (preemptions will only occur at pthread functions).

Limitations

The binary instrumentation is not guaranteed to succeed, so usually about 97-99% of loads and stores are truly instrumented.

Right now malloc / free and other dynamic calls are not used as scheduling points, but this can be easily remedied.

Data

See scripts/data-analysis/full-data.csv for data from our tool / framework from E1. See scripts/data-analysis/freq-data/*.csv for data from our tool / framework from E2. Additional data and analysis scripts for PERIOD etc. are all in scripts/data-analysis.

Errors to Ignore

e.g.

rm: cannot remove 'SafeStack.afl': No such file or directory
rm: cannot remove 'afl-in': No such file or directory

dylanjwolff/RFF