/TabCSDI

A code for the NeurIPS 2022 Table Representation Learning Workshop paper: "Diffusion models for missing value imputation in tabular data"

Primary LanguagePythonMIT LicenseMIT

TabCSDI: Diffusion models for missing value imputation in tabular data

This is the repo for the workshop paper: Diffusion models for missing value imputation in tabular data | OpenReview.

Setup

pip install -r requirements.txt

Running experiments

We provide 3 datasets, including Breast (original), Breast (diagnostic), and Census datasets. For census datasets, three categorical variable handling methods are provided.

Run pure numerical datasets experiments:

  • Breast (original) dataset
python exe_breast.py
  • Breast (diagnostic) dataset
python exe_breastD.py

Run mixed datatypes experiments with census dataset:

  • Using feature tokenization for categorical variables
python exe_census_ft.py
  • Using analog bits encoding for categorical variables
python exe_census_analog.py
  • Using one-hot encoding for categorical variables
python exe_census_onehot.py

Acknowledgements

The code repo is built upon the CSDI repo.

Reference

If you find our code useful or use it in your work, please cite the following paper:

@inproceedings{tashiro2021csdi,
  title={Diffusion models for missing value imputation in tabular data},
  author={Zheng, Shuhan and Charoenphakdee, Nontawat},
  booktitle={NeurIPS Table Representation Learning (TRL) Workshop},
  year={2022}
}