Team GooseDP Solution to Differential Privacy Temporal Map Challenge (DeID2)-Sprint 3
We are team GooseDP from the University of Waterloo. We finished the 5th in the NIST Temporal Map Challenge: Sprint 3. This repository is our open-sourced solution to the challenge. The diagram below illustrates a summary of our approach and for full details we refer to the technical report in this repository (NIST_DP_Privacy_GooseDP_Writeup.pdf
). If anyone wants to generalize this approach to datasets in other domains, we have some suggested guidelines located here (Approach_to_Generalization.pdf
).
├── Submission directory/
│ ├── Step0_Archetype_Generation/ *Step-0: Preprocessing
| ├── Results_GMM/
| └── k_archetypes.py
│ ├── Step1_Archetype_Counting/ *Step-1: Private Analysis
| └── archetype_company_counts.py
│ ├── Step2_ Synthetic_Data_Generation/ *Step-2: Synthetic Record Generation
| ├── sample_triplets.py
| └── post_col_generation.py
| ├── data/ *Ground Truth Data and Parameters File
| ├── parameters.json
| ├── (public_data.csv) *Public Dataset
| └── (ground_truth.csv) *Private Dataset
| ├── main.py *Program Entrance
| ├── requirements.txt *Package Requirements
| ├── NIST_DP_Privacy_GooseDP_Writeup.pdf *Technical Report
| └── Approach_to_Generalization.pdf *Generalization Guidance
If you want to run our submission manually, first put the private dataset (ground_truth.csv
file) and the public dataset (public_data.csv
file) under the data/
directory, and install the required packages.
pip install -r requirements.txt
Then run the command to execute the main file.
python main.py
Main Function (main.py
)
The program entrance to our code submission.
We create a script create_submission.sh
to help zip our submission code files.
Step 0: Preprocessing (Step0_Archetype_Generation/
)
The preprocessing step in the write-up is corresponding to the contents in the Archetype_Generation/
directory.
Under this directory, the file k_archetypes.py
is used for archetype generation and the generated archetype information files are stored in the Results_GMM/
directory.
Note: This step only uses the public dataset, therefore we create the archetype files locally and associate those files in the submission.
Step 1: Private Analysis (Step1_Archetype_Counting/
)
The private analysis step in the write-up is corresponding to the contents in the Archetype_Counting/
directory.
Under this directory, the file archetype_company_counts.py
is used for creating private histograms over the private dataset (details referring to the write-up) and returning privatized counts of taxis and companies.
Step 2: Synthetic Data Generation (Step2_sample_triplets/
)
Synthesize Taxi-trips Record (sample_triplets.py
)
The synthetic record step in the write-up is corresponding to the contents in the Step2_sample_triplets/
directory.
Under this directory, the file sample_triplets.py
is used for generating synthetic records for ('taxi_id', 'shift', 'company_id', 'pickup_community_area', 'dropoff_community_area')
columns.
Synthesize Other Columns (post_col_generation.py
)
The post processing step in the write-up is corresponding to the contents in the Step3_nonprivate_gen/
directory.
Under this directory, the file post_col_generation.py
is used for generating synthetic records for the rest of the columns, i.e., ('fare', 'trip_miles', 'trip_seconds', 'tips', 'trip_total', 'payment_type')
, based on the k-marginals.
@misc{GooseDP_Syn, author = {Covington, Christian and Knopf, Karl and Mohapatra, Shubhankar and Zhang, Shufan}, title = {TaxiTrip-Synthesizer: Team GooseDP Solution to Differential Privacy Temporal Map Challenge (DeID2)-Sprint 3}, year = {2021}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/ctcovington/goosedp_sprint3_open_source}} }
Christian Covington
Karl Knopf
Shubhankar Mohapatra
Shufan Zhang