Analyzing the Leaky Cauldron
The goal of this project is to evaluate the privacy leakage of differential private machine learning algorithms.
The code has been adapted from the code base of membership inference attack work by Shokri et al.
Below we describe the setup and installation instructions. To run the experiments for the following projects, refer to their respective README files (hyperlinked):
- Evaluating Differentially Private Machine Learning in Practice (
evaluating_dpml\
) - Revisiting Membership Inference Under Realistic Assumptions (
improved_mi\
) - Are Attribute Inference Attacks Just Imputation? (
improved_ai\
)
Software Requirements
- Python 3.8
- Tensorflow : To use Tensorflow with GPU, cuda-toolkit-11 and cudnn-8 are also required.
- Tensorflow Privacy
Installation Instructions
Assuming the system has Ubuntu 18.04 OS. The easiest way to get Python 3.8 is to install Anaconda 3 followed by installing the dependencies via pip. The following bash code installs the dependencies (including scikit_learn
, tensorflow>=2.4.0
and tf-privacy
) in a virtual environment:
$ python3 -m venv env
$ source env/bin/activate
$ python3 -m pip install --upgrade pip
$ python3 -m pip install --no-cache-dir -r requirements.txt
Furthermore, to use cuda-compatible nvidia gpus, the following script should be executed (copied from Tensorflow website) to install cuda-toolkit-11 and cudnn-8 as required by tensorflow-gpu:
# Add NVIDIA package repositories
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
$ sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
$ sudo apt-get update
$ wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
$ sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
$ sudo apt-get update
$ wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
$ sudo apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
$ sudo apt-get update
# Install development and runtime libraries (~4GB)
$ sudo apt-get install --no-install-recommends \
cuda-11-0 \
libcudnn8=8.0.4.30-1+cuda11.0 \
libcudnn8-dev=8.0.4.30-1+cuda11.0
# Reboot. Check that GPUs are visible using the command: nvidia-smi
# Install TensorRT. Requires that libcudnn8 is installed above.
$ sudo apt-get install -y --no-install-recommends libnvinfer7=7.1.3-1+cuda11.0 \
libnvinfer-dev=7.1.3-1+cuda11.0 \
libnvinfer-plugin7=7.1.3-1+cuda11.0
Obtaining the Data Sets
Data sets can be obtained using the preprocess_dataset.py
script provided in the extra/
folder. The script requires raw files for the respective data sets which can be found online using the following links:
- Purchase-100X: The source file
transactions.csv
can be downloaded from https://www.kaggle.com/c/acquire-valued-shoppers-challenge/data and should be saved in thedataset/
folder. - Census19: The source files can be downloaded from https://www2.census.gov/programs-surveys/acs/data/pums/2019/1-Year/ and should be saved in the
dataset/census/
folder. Alternatively, the source files can be obtained by running thecrawl_census_data.py
script in theextra/
folder:
$ python3 crawl_census_data.py
- Texas-100X:
PUDF_base1q2006_tab.txt
,PUDF_base2q2006_tab.txt
,PUDF_base3q2006_tab.txt
andPUDF_base4q2006_tab.txt
files can be downloaded from https://www.dshs.texas.gov/THCIC/Hospitals/Download.shtm and should be saved in thedataset/texas_100_v2/
folder.
Once the source files for the respective data set are obtained, preprocess_dataset.py
script would be able to generate the processed data set files, which are in the form of two pickle files: $DATASET
_feature.p and $DATASET
_labels.p (where $DATASET
is a placeholder for the data set file name). For Purchase-100X, $DATASET = purchase_100
. For Texas-100X, $DATASET = texas_100_v2
. For Census19, $DATASET = census
.
$ python3 preprocess_dataset.py $DATASET --preprocess=1
Alternatively, Census19 data set (as is used in the attribute inference paper) can also be found in the dataset/
folder in zip format.
For pre-processing other data sets, bound the L2 norm of each record to 1 and pickle the features and labels separately into $DATASET
_feature.p and $DATASET
_labels.p files in the dataset/
folder (where $DATASET
is a placeholder for the data set file name, e.g. for Purchase-100 data set, $DATASET
will be purchase_100
).