/SDGym

Primary LanguagePython

SDGym

Synthetic Data Gym: A framework to benchmark the performance of synthetic data generators for non-temporal tabular data.

Getting started

Installation

To install SDGym you only need to fork the repository, clone it and install its requirements

git clone git@github.com:$YOUR_USERNAME/SDGym.git
cd SDGym/
pip install -r requirements.txt

After installing the requirements, we just need to build the models that are not in python

sudo apt install build-essential
cd privbayes
make

Data requirements

Input Format

The input for all the synthesizers includecd in SDGym is a couple of files:

  • A npz file containing two tables, train and test, where each is a numpy.ndarray. All continous columns are stored as is, while categorical and ordinal columns are stored using integers, altought the dtype will be float because numpy does not support mixed types.

  • A json file containing the metadata for the dataset, that is, information about the columns, like the max and minimum values on continous columns or the mapping from integer to string in categorical columns.

[
	{
		'name': None or str
		'type': 'Ordinal' or 'Categorical' or 'Continuous'

		# if Ordinal or Categorical
		'size': integer
		'i2s': list of str

		# if Continuous
		'min': float
		'max': float
	},
	...
]

Output Format

The results from SDGym are stored in the output folder with the following structure:

output
   __results__
       $MODEL.json	# Raw scores for model $MODEL
       ...

   __summaries__
      result.csv	# Table summary of the results
      barchart_$MODEL	# Bar chart for model $MODEL
      ...

Demo Datasets

SDGym includes a few datasets to use for development or demonstration purposes. These datasets have been preprocessed to be ready to use with SDGym, following the requirements specified in the Input Format section.

These datasets can be downloaded from here. After downloading them, you just need to unzip their contents into a folder named data at the root of SDGym.

You can also execute the following commands from the root of the repository:

curl https://s3.amazonaws.com/sdgym/SDGymBenchmarkData.zip -o data.zip
mkdir data
unzip data.zip -d data/

Have below the list of included datasets and their original source:

Simulated data

  • Bivariate

    • Gaussian Ring: Gaussian Mixtures arranged in a ring.

    • Gaussian Grid: Gaussian Mixtures arranged in a grid.

  • Multivariate Structured Data: Generate samples from some pre-specified common causal strutures.

    Chain Tree
    Fully Connected General

Quickstart

After installing the requirements and preparing the datasets, you only need to run the following commands to evaluate a synthesizer:

python3 -m launcher SYNTHESIZER
  • SYNTHESIZER: Name of the synthesizer you want to evaluate.

    Available synthesizers: [bgmvae, bgmwgan, clbn, identity, independent, medgan, privbn, uniform, veegan]

Optional arguments:

  • --datasets: A list of datasets to evaluate the synthesizer with. If the argument is not present or the datasets are not specified it defaults to all datasets.

    Available datasets: [ asia, alarm, child, insurance, grid, gridr, ring, adult, credit, census, news, covtype, intrusion, mnist12, mnist28]

  • --force: Wheter or no overwritte results.

  • --repeat(int): Number of copies to generate for each dataset.

Summary Examples