Repo containing all raw codes for reproducing results from "Boosting Data Analytics with Synthetic Volume Expansion" by X.Shen, Y. Liu and R. Shen.
This repo is undergoing structural changes for better readability, but most relevant codes for reproducing results can be accessed from:
- sentiment: Sentiment analysis with Syn-Slm.
- conditional: Tabular data regression with Syn-Slm.
- tab-ddpm/synpred: Syn-Boost for predictions on benchmark datasets and simulations.
- tab-ddpm/syninf: Syn-Test for inference on real datasets and simulations.
The training codes for the tabular diffusion model are mainly adapted from "TabDDPM: Modelling Tabular Data with Diffusion Models" (paper, code)
- Add Makefile for reproducing the results with a pipeline for each example
(Tested on Ubuntu 18.04, with 4 TITAN RTX GPUs and CUDA Version 10.2.)
Create conda environment and install dependencies using the following commands:
# export REPO_DIR=your/path/to/the/cloned/repo/syn
export REPO_DIR=~/Documents/syn/
conda env create -f environment.yml
conda activate syn
poetry install --no-root
Both training and testing results for fine-tuning GPT-3.5 can be found in gpt_result.csv. Note that we are using GPT-3.5 as a completion model, which is essentially a conditional generaotr, and is in alignment with the central idea of Syn-Slm.
For other approaches, the results can be obtained via:
cd sentiment/
# download the data
gdown 'https://drive.google.com/file/d/14ixyrPbne9IfD_NCaSkXd7rarasSMgFY/view?usp=drive_link' --fuzzy -O ./data/
gdown 'https://drive.google.com/file/d/15L-hkzSNBVMnC665YXnjO52DPWUS_izi/view?usp=drive_link' --fuzzy -O ./data/
# download the DistilBERT checkpoint
gdown 'https://drive.google.com/file/d/1G8MF5l4LxgOtfXCWiiXrczekfTVlGiMC/view?usp=drive_link' --fuzzy -O ./ckpt/
# # or train on four GPUs and save the checkpoint
# python imdb_distilbert.py --train
# load the saved checkpoint directly for evaluation DistilBERT
python imdb_distilbert.py --predict
# train and evaluate the performance of LSTM
python imdb_lstm.py
Compare with traditional approach (CatBoost) with different
cd conditional/
python ablation_sigma.py --sigma 0.1 --device "cuda:0"
cd tab-ddpm/
gdown 'https://drive.google.com/file/d/1j513rf5RGT4I-hnyu2aUic-s76nps8EO/view?usp=drive_link' --fuzzy
unzip data.zip
gdown 'https://drive.google.com/file/d/1SumvPWtcWbvWxtED9AzLORBGGCLz9H0a/view?usp=drive_link' --fuzzy
unzip exp.zip
rm -rf *.zip
Notebook prediction_pool_main.ipynb aggregates the results from Syn-Boost on eight benchmark tabular dataset.
For this experiment, optuna is used to tune the Cat-Boost model trained on synthetic data. To get the result for each of the eight datasets, run (for example):
python synpred/prediction_pool_tune.py \
--dsname insurance \
--maxrho 20 \
--nratios 10 \
--ntrials 10 \
--device "cuda:0" \
A corresponding optuna
study will be saved under the directory ratio_optuna_studies
containing the tuning results of SynBoost.
Notebook evaluate.ipynb includes
- Evaluation results from the fine-tuned generator on Adult-Female data, using marginal distributions, pairwise correlations, as well as various distributional distances.
- SynBoost tuning results, along with some reference metrics.
- To reproduce the results visualized in the notebook:
cd synpred/transfer_adult
# Generate dataframes for evaluation purpose
python gen_eval_sample.py
# Perform Syn-Boost tuning with fine-tuned generator
python synboost.py
Notebook evaluate.ipynb evaluates the performance of pre-trained generator and investifate the effect of generational error on the performance of Syn-Boost. To reproduce the results,
cd synpred/sim_prediction/
# Specify pre-training size, pre-train the model and fine-tune it on the raw training data
python train.py
# Syn-Boost tuning versus CatBoost
python synboost.py
# 2-Wasserstein distance of the pre-trained/fine-tuned generators
python w2_multiprocessing.py
Under individual directories, one can find notebooks that aggregate and visualizes Syn-Test results including the ratio tuning curve, estimated null distribution and others.
cd syninf/
# Tune synthetic-to-raw ratio based on fine-tuend generators
python syntest_california.py
cd syninf/
# Tune synthetic-to-raw ratio based on fine-tuend generators
python syntest_adult.py
cd syninf/sim_inference
# Prepare the data, pre-train/fine-tune generators and perform Syn-Test
python syntest.py
@article{shen2023boosting,
title={Boosting data analytics with synthetic volume expansion},
author={Shen, Xiaotong and Liu, Yifei and Shen, Rex},
journal={arXiv preprint arXiv:2310.17848},
year={2023}
}