PipelineDP

PipelineDP is project for performing Differentially Private (DP) aggregations in Python Data Pipelines.

The project is in the early development stage. More description will be added later.

Development

To install the requirements for local development, run make dev.

Please run make precommit to auto-format, lint check, and run tests. Individual targets are format, lint, test, clean, dev.

This project depends on numpy apache-beam pyspark absl-py dataclasses

For installing with poetry please run:

For installing with pip please run:

For the development it is convenient to run an end-to-end example.

For doing this:

Download Netflix prize dataset from https://www.kaggle.com/netflix-inc/netflix-prize-data and unpack it.
The dataset itself is pretty big, for speed-up the run it's better to use a part of it. You can generate a part of it by running in bash:

head -10000 combined_data_1.txt > data.txt

or by other way to get a subset of lines from the dataset.
Run python movie_view_ratings.py --input_file=<path to data.txt from 2> --output_file=<...>