Leakage Analysis

A static analysis tool to detect test data leakage in Python notebooks

This is the tool of the ASE'22 paper: Data Leakage in Notebooks: Static Detection and Better Processes. An online demo is also available. For our evaluation scripts and materials, please refer to this repo.

How to build

Install souffle, the datalog engine we use for our main analysis. Make sure that souffle could be directly invoked in command line.
Pull and build our customized version of [pyright], the type inference engine we use: git submodule update --init --recursive (please refer to the submodule for building the project).
Install required Python packages in requirements.txt. We use Python 3.8 for our tool; different Python versions might result in different parsed AST and unexpected errors.

How to use

Run analysis for a single Python file: python3 -m src.main /path/to/file
Run analysis for all Python files in a directory: python3 -m src.run /path/to/dir
More information could be found using the -h flag.

How to build and run Docker image

Pull our customized version of pyright, the type inference engine we use: git submodule update --init --recursive.
Add all used Python libraries to requirements.txt, which will be installed in the container and used by pyright.
Build Docker image: docker build -t leakage-analysis .
Run Docker image: docker run -v /path/to/dir:/path/to/dir leakage-analysis /path/to/dir/$FILE -o. All to-be-analyzed notebooks should be converted to Python files and stored in /path/to/dir.

How to run Docker compose

Make sure Docker is installed.
Copy .env.example file to a new .env file.
Change the two path in .env to the path of the notebook directory, HOST_NOTEBOOK_DIR, to mount inside the container, and the filename of the python file, FILE, to be scanned.

How to read output

For a given input file test.py, an output html file test.html will be generated if -o flag is specified.

In test.html, we show the analysis results alongside input code. A summary table on detected leakage issues is shown on the top. Users could also utilize the interactive buttons to highlight relevant code and navigate through different code segments.

Internal Structure

Given a Python file, src/main.py first parses the input into AST. Then it feeds AST to a GlobalCollector instance (from global_collector.py) that collects global variables we could not rename in later transformations, which we will ignore later.

Next, it feeds AST to a CodeTransformer instance (from irgen.py) that translates original Python code to a simpler version that 1) breaks down complex statements to multiple simpler ones, and 2) translates code to the static single assignment (SSA) form.

Then it calls the type inference engine on the transformed code file. With type inference information, it converts the code file to datalog facts the final analysis could read, using FactGenerator from factgen.py.

Finally, it performs datalog analysis (main.dl) on generated facts and outputs results in the same directory.

Directory Structure

src
├── factgen.py: convert transformed code to datalog facts
├── global_collector.py: collect global variables
├── __init__.py
├── irgen.py: transform code to simpler SSA form
├── main.dl: main datalog analysis that analyzes leakage
├── main.py: run analysis on a single file
├── render.py: output a html file based on analysis results and original code
├── run.py: run analysis on multiple files
└── scope.py: manage variable scopes for renaming purposes

OwenTruong/leakage-analysis