Application of the rmckenna mechanism to the adult dataset

This repository is based on the mechanism used by Ryan McKenna who won the first place in the Differential Privacy Synthetic Data Challenge of the NIST in 2018. It is a modification of his submitted solution to run on the adult dataset.

Installation

Install the Python3 working environment.

# Create the virtual environment
python3 -m venv venv

# Enter in the virtual environment
. venv/bin/activate

# Install the dependencies
pip install -r requirements.txt

Clone the private-pgm repository in another directory and add it to the python path. You can also add the configuration of the python path to your .bashrc to load it automatically.

# Clone it in another directory
cd ..

# Clone the private-pgm repository
git clone https://github.com/ryan112358/private-pgm

# Add the src directory to the python path
export PYTHONPATH=$PYTHONPATH:`pwd`/private-pgm/src

Dataset

The adult dataset can be downloaded from this link. Afterwards, format it using notebooks/adult-preprocess.ipynb and generate the required domain information file using notebooks/adult-domain.ipynb.

Generate a synthetic dataset

Use the following command to generate a synthetic dataset. You can also configure the parameters (use --help to list them).

python adult.py  # --help displays the parameters

Execution on the GPU

Installation

Check that the driver of your graphics card is installed and that it supports cuda. Install cuda from the website of Nvidia and reboot your computer after the installation.

Check your version of cuda using /usr/local/cuda/bin/nvcc --version or nvcc --version. If your version of cuda is 11, install the corresponding version of pytorch using:

# If your version of cuda is >= 11
pip install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html

Check that torch has access to cuda:

python -c "import torch; print(torch.cuda.is_available())"

A little modification has to be done on the sources of the private-pgm repository. Add the following lines before the line 267 that sets the diff variable:

# If we are using the torch backend, the Q linear operator (of
# type matrix.Identity) was not a Tensor and generated an
# error due to this unaccepted format for Tensor operations.
if all((self.backend == 'torch',
        str(type(Q)) == "<class 'matrix.Identity'>")):
    import torch

    # First, we transform Q into a numpy array
    q_as_numpy_array = Q * np.identity(Q.shape[1])

    # Then, we format this numpy array to a Tensor
    Q = torch.as_tensor(q_as_numpy_array, dtype=torch.float32,
                        device=self.Factor.device)

    # Just a verification that the formatting of Q keeps the
    # same values. It is the case for the multiple executions
    # that I launched, you can keep or remove this as you want.
    assert np.array_equal(q_as_numpy_array, Q.cpu().numpy())

Usage

You can generate a synthetic dataset using GPU by setting the backend parameter to torch.

python adult.py --backend torch  # use --help instead to display the parameters

You can monitor the usage of the GPU by watch -d -n 0.5 nvidia-smi. You can also use nvtop (sudo apt install -y nvtop then nvtop).

tandriamil/rmckenna-adult

Application of the rmckenna mechanism to the adult dataset

Installation

Dataset

Generate a synthetic dataset

Execution on the GPU

Installation

Usage