The code in this repository corresponds to the methodology and results as presented in the following two papers:
- "Achilles' Heels: Vulnerable Record Identification in Synthetic Data Publishing"
- "Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data"
To replicate our conda environment, it should suffice to run the following sets of commands:
- Create and activate the env:
conda create --name mia_synthetic python=3.9
conda activate mia_synthetic
- Clone and install the requirements from the reprosyn repository:
git clone https://github.com/alan-turing-institute/reprosyn
cd reprosyn
curl -sSL https://install.python-poetry.org | python3 -
poetry install -E ektelo
. To install poetry on your system we refer to their installation instructions.- Note that, in order to get it to work for continuous attributes as well, you might look into this raised issue.
Then we have to install the C-based optimized QBS which was taken from the following repository (if you use it please cite the paper):
cd src/optimized_qbs/
python setup.py install
-
In order to replicate the computation of the vulnerable record identification method, we refer to the notebook
notebooks/Identify_vulnerable_records.ipynb
. -
In order to replicate the attacks for specific target records, we refer to the python file
Achilles_main.py
and the scriptscripts/run_experiment_achilles.sh
. -
For all details concerning our novel target-attention attack, we refer to the code in
src/set_based_classifier
.
(3) Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data
In order to replicate the results of the MIAs using only synthetic data, we refer to the python file synthetic_only_main.py
and the script scripts/run_experiment_synthetic_only.sh
.
The code contains the functionallity to run the experiments under multiple scenarios, for details we refer to the paper.
Throughout our contributions, we have considered two tabular datasets, referred to as UK Census and Adult.
For ease of usability, we provide a very small sample of the original data in the data
folder.
We also include the corresponding metadata files that are required as input for the reprosyn pipeline to generate synthetic data.
For full access to the respective datasets, we refer to the references in our papers.