/sugarcoat-pipeline

CLI that implements the SugarCoat pipeline

Primary LanguageJavaScriptMozilla Public License 2.0MPL-2.0

SugarCoat Pipeline

SugarCoat is a tool that allows filterlist authors to automatically patch JavaScript scripts to restrict their access to sensitive data according to a custom privacy policy. Check out the blog post and paper!

This repo is an implementation of the SugarCoat pipeline. It uses pagegraph-crawl to crawl a given website and generate PageGraph graphs, pagegraph-rust-cli to get JavaScript script sources that match adblock rules from the generated graphs, and sugarcoat for the actual patching of JavaScript scripts.

You can specify which sensitive Web APIs to block access to in policy.json (example). All SugarCoat pipeline output is generated in output/ by default (can be changed via CLI argument). Patched scripts go in output/sugarcoated_scripts and the generated EasyList-style filter rules in output/sugarcoat_rules.txt.

Setup

  1. Git clone this repo:
git clone https://github.com/brave-experiments/sugarcoat-pipeline
cd sugarcoat-pipeline
  1. You need the Rust and Cargo toolchain setup in order to use the SugarCoat pipeline. The pagegraph-rust-cli Rust binary is built using Cargo as part of the post-installation phase.

  2. To install the NPM dependencies:

npm install

Note that the minimum Node version required is 14.18.1.

  1. You will also need a working PageGraph binary (an instrumented version of the Brave browser) to crawl the website you want to sugarcoat and generate .graphml files that are then analyzed for scripts. You can build a binary following the wiki instructions, or you can download one for Intel Macs from the Release page here. Remember to unzip it! Alternatively, on the command line:

For Mac

# Download the latest Mac Intel zip (and follow redirect)
curl -L https://github.com/brave-experiments/sugarcoat-pipeline/releases/latest/download/pagegraph-mac-intel.zip -o pagegraph-mac-intel.zip
unzip pagegraph-mac-intel.zip
rm pagegraph-mac-intel.zip
  1. (optional) You will need a local copy of a filter list - you can get the latest copy of the easylist filterlist here, easyprivacy here or uBlockOrigin Unbreak here. Alternatively, there's copies in the repo.
curl -s https://easylist.to/easylist/easylist.txt -o easylist.txt
curl -s https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/unbreak.txt -o unbreak.txt
curl -s https://easylist.to/easylist/easyprivacy.txt -o easyprivacy.txt

Usage

npm run sugarcoat-pipeline  -- -b <PATH_TO_PAGEGRAPH_BINARY> -u <URL> -t <SECS_TO_RUN_PAGEGRAPH> -l <FILTERLISTS>

The filterlists can be space-separated i.e. -l easylist.txt unbreak.txt.

Example:

For Mac

npm run sugarcoat-pipeline  -- -b pagegraph-mac-intel.app/Contents/MacOS/Brave\ Browser\ Development   -t 10 -l easylist.txt unbreak.txt easyprivacy.txt -o output -u https://metacritic.com 

(note that on macOS the binary has to be the executable under the .app).

Now check output/ (is auto-generated).

Help

$ npm run sugarcoat-pipeline  -- -h

> sugarcoat-pipeline@0.1.0 sugarcoat-pipeline
> node sugarcoat-pipeline.js "-h"

usage: sugarcoat-pipeline.js [-h] [-b BINARY] [-u URL] [-t SECS] [-d] -l FILTER_LISTS [FILTER_LISTS ...] [-p POLICY] [-o OUTPUT] [-g GRAPHS_DIR_OVERRIDE] [-k] [-r RETRIES] [-m] [-s]

SugarCoat pipeline CLI

optional arguments:
  -h, --help            show this help message and exit
  -b BINARY, --binary BINARY
                        Path to the PageGraph-enabled build of Brave
  -u URL, --url URL     The URL to record.
  -t SECS, --secs SECS  The dwell time in seconds. Default: 30 seconds
  -d, --debug           Print debugging information
  -l FILTER_LISTS [FILTER_LISTS ...], --filter-lists FILTER_LISTS [FILTER_LISTS ...]
                        Filter lists to use
  -p POLICY, --policy POLICY
                        Path to policy file. Default: policy.json
  -o OUTPUT, --output OUTPUT
                        Path to output directory. All generated files go here. Default: output
  -g GRAPHS_DIR_OVERRIDE, --graphs-dir-override GRAPHS_DIR_OVERRIDE
                        Path to graphs directory. If set, skips PageGraph generation
  -k, --keep            Do not erase intermediary files generated in output for sugarcoat
  -r RETRIES, --retries RETRIES
                        Number of times a URL is attempted to be re-crawled on failure. Default: 5
  -m, --no-minify       Do not minify generated SugarCoat script.
  -s, --keep-original-script-name
                        Keep original script name instead of setting it to be hash of contents.

Feedback

Something not working? Please raise an issue.

Testing

This project uses mocha for tests.

npm run test

To run in debug mode,

DEBUG=true npm run test

To run a specific test,

npm run test -- -g "simple"

To run tests in debug mode:

DEBUG=true npm run test