PipelineDP is project for performing Differentially Private (DP) aggregations in Python Data Pipelines.
The project is in the early development stage. More description will be added later.
To install the requirements for local development, run make dev
.
Please run make precommit
to auto-format, lint check, and run tests.
Individual targets are format
, lint
, test
, clean
, dev
.
Google Python Style Guide https://google.github.io/styleguide/pyguide.html
This project depends on numpy apache-beam pyspark absl-py dataclasses
For installing with poetry please run:
-
git clone https://github.com/OpenMined/PipelineDP.git
-
cd PipelineDP/
-
poetry install
For installing with pip please run:
-
pip install numpy apache-beam pyspark absl-py
-
(for python 3.6)
pip install dataclasses
For the development it is convenient to run an end-to-end example.
For doing this:
-
Download Netflix prize dataset from https://www.kaggle.com/netflix-inc/netflix-prize-data and unpack it.
-
The dataset itself is pretty big, for speed-up the run it's better to use a part of it. You can generate a part of it by running in bash:
head -10000 combined_data_1.txt > data.txt
or by other way to get a subset of lines from the dataset.
-
Run python movie_view_ratings.py --input_file=<path to data.txt from 2> --output_file=<...>