/SynthVAE

Synthetic data generation by a Variational AutoEncoder with Differential Privacy assessed using Synthetic Data Vault metrics

Primary LanguagePythonMIT LicenseMIT

Synthetic Data Exploration: Variational Autoencoders

NHSX Analytics Unit - PhD Internship Project

About the Project

This repository holds code for the NHSX Analytics Unit PhD internship project (previously known as Synthetic Data Generation - VAE) contextualising and investigating the potential use of Variational AutoEncoders (VAEs) for synthetic health data generation undertaken by Dominic Danks.

Project Description - Synthetic Data Exploration: Variational Autoencoders

Note: No data, public or private are shared in this repository.

Project Stucture

  • The main code is found in the root of the repository (see Usage below for more information)
  • The accompanying report is also available in the reports folder
  • More information about the VAE with Differential Privacy can be found in the model card

N.B. A modified copy of Opacus (v0.14.0), a library for training PyTorch models with differential privacy, is contained within the repository. See the model card for more details.

Built With

Python v3.8

Getting Started

Installation

To get a local copy up and running follow these simple steps.

To clone the repo:

git clone https://github.com/nhsx/SynthVAE.git

To create a suitable environment:

  • python -m venv synth_env
  • source synth_env/bin/activate
  • pip install -r requirements.txt

Usage

SDV Baselines

To reproduce the experiments contained in the report involving the SDV baseline models (e.g. CopulaGAN, CTGAN, GaussianCopula and TVAE), run sdv_baselines.py. The parameters can be found using the --help flag:

python sdv_baselines.py --help

usage: sdv_baselines.py [-h] [--n_runs N_RUNS] [--model_type {CopulaGAN,CTGAN,GaussianCopula,TVAE}]

optional arguments:
  -h, --help            show this help message and exit
  --n_runs N_RUNS       set number of runs/seeds
  --model_type {CopulaGAN,CTGAN,GaussianCopula,TVAE}
                        set model for baseline experiment

Scratch VAE + Differential Privacy

To reproduce the experiments contained in the report involving the VAE with/without differential privacy, run scratch_vae_expts.py. The parameters can be found using the --help flag:

python scratch_vae_expts.py --help

usage: scratch_vae_expts.py [-h] [--n_runs N_RUNS] [--diff_priv DIFF_PRIV] [--savefile SAVEFILE]

optional arguments:
  -h, --help            show this help message and exit
  --n_runs N_RUNS       set number of runs/seeds
  --diff_priv DIFF_PRIV
                        run VAE with differential privacy
  --savefile SAVEFILE   save trained model's state_dict to file

Code to load a saved model and generate correlation heatmaps is contained within plot.py. The file containing the save model's state_dict should be provided via a command line argument:

python plot.py --help

usage: plot.py [-h] --savefile SAVEFILE

optional arguments:
  -h, --help           show this help message and exit
  --savefile SAVEFILE  load trained model's state_dict from file

Dataset

Experiments are run against the Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT) dataset accessed via the pycox python library.

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

See CONTRIBUTING.md for detailed guidance.

License

Distributed under the MIT License. See LICENSE for more information.

Contact

To find out more about the Analytics Unit visit our project website or get in touch at analytics-unit@nhsx.nhs.uk.