DataSynthesizer can generate a synthetic dataset from a sensitive one for release to public. It is developed in Python 3.6 and requires some third-party modules, including numpy, scipy, pandas, and dateutil.
Its usage is presented in the following Jupyter Notebooks,
- DataSynthesizer Usage (random mode).ipynb
- DataSynthesizer Usage (independent attribute mode).ipynb
- DataSynthesizer Usage (correlated attribute mode).ipynb
- The input dataset is a table in first normal form (1NF).
- The active domain is the domain for each attribute of the table. When implementing differential privacy, DataSynthesizer injects noises into the statistics within active domain.
There is a web-based UI in webUI/
directory, which is a self-contained Django project. Here is a simple way to run it on your machine.
- Miniconda is recommended as the Python distribution. It also contains a user-friendly package manager "conda". Note that DataSynthesizer is Python 3 based.
- After installing it on your machine, run
conda install numpy pandas scikit-learn matplotlib seaborn jupyter django
in terminal to install the packages.
- Clone or download this repo to your local machine.
- Open a terminal and
cd [repo directory]/webUI/
- Run
python manage.py runserver
. The web-based UI will be hosted athttp://127.0.0.1:8000/synthesizer/
.