Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space [Paper link]
Python version: 3.10
Create environment
conda create -n tabsyn python=3.10
conda activate tabsyn
Install pytorch
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
Install other dependencies
pip install -r requirements.txt
Install dependencies for GOGGLE
pip install dgl -f https://data.dgl.ai/wheels/cu117/repo.html
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.0.1+cu117.html
Create another environment for the quality metric (package "synthcity")
conda create -n tabsyn python=3.10
conda activate tabsyn
pip install synthcity
pip install category_encoders
Download raw dataset:
python download_dataset.py
Process dataset:
python process_dataset.py
For baseline methods, use the following command for training:
python main.py --dataname [NAME_OF_DATASET] --method [NAME_OF_BASELINE_METHODS] --mode train
Options of [NAME_OF_DATASET]: adult, default, shoppers, magic, beijing, news Options of [NAME_OF_BASELINE_METHODS]: goggle, great, stasy, codi, tabddpm
For Tabsyn, use the following command for training:
# train VAE first
python main.py --dataname [NAME_OF_DATASET] --method vae --mode train
# after the VAE is trained, train the diffusion model
python main.py --dataname [NAME_OF_DATASET] --method tabsyn --mode train
For baseline methods, use the following command for synthesis:
python main.py --dataname [NAME_OF_DATASET] --method [NAME_OF_BASELINE_METHODS] --mode sample --save_path [PATH_TO_SAVE]
For Tabsyn, use the following command for synthesis:
python main.py --dataname [NAME_OF_DATASET] --method tabsyn --mode sample --save_path [PATH_TO_SAVE]
The default save path is "synthetic/[NAME_OF_DATASET]/[METHOD_NAME].csv"
Density estimation:
python eval/eval_density.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA]
Alpha Precision and Beta Recall:
python eval/eval_quality.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA]
Machine Learning Efficiency:
python eval/eval_mle.py --dataname [NAME_OF_DATASET] --model [METHOD_NAME] --path [PATH_TO_SYNTHETIC_DATA]
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.
We appreciate your citations if you find this repository useful to your research!
@article{zhang2023mixed,
title={Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space},
author={Zhang, Hengrui and Zhang, Jiani and Srinivasan, Balasubramaniam and Shen, Zhengyuan and Qin, Xiao and Faloutsos, Christos and Rangwala, Huzefa and Karypis, George},
journal={arXiv preprint arXiv:2310.09656},
year={2023}
}