/awesome-diffusion-models-for-tabular-data

This is a curated list of research on diffusion models for tabular data, and serves as the official repository for the survey paper "Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions"

OtherNOASSERTION

Awesome License: CC BY 4.0 Visitors

Diffusion Models for Tabular Data

We explore recent advancements in diffusion models for tabular data and highlight key challenges, current progress, and future directions.

📖 You are welcome to read our paper and share your feedback!
👉 Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions (under review)

⭐ Support & Citation

If you find our survey and this repository helpful, please star this project and cite our paper:

@misc{liDiffusion2025,
  title = {Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions},
  author = {Zhong Li, Qi Huang*, Lincen Yang*, Jiayang Shi, Zhao Yang, Niki van Stein, Thomas Bäck, Matthijs van Leeuwen},
  year = {2025},
  month = {February},
  primaryclass = {cs},
  doi = {}
}

Table of Contents

Timeline of GenAI for Tabular Data

Timeline

Taxonomy of Diffusion Models for Tabular Data

Research on generative models for tabular data is primarily motivated by real-world applications. Based on their usage, we classify existing studies into four main categories:

  • Data Augmentation: Artificially generate new tables or entries from existing datasets.

    • Commonly used to address class imbalance in classification tasks.
    • Enhances the robustness and performance of machine learning models.
  • Data Imputation: Fill in missing or incomplete entries within existing tables.

  • Trustworthy Data Synthesis: Generate entirely new synthetic tables or entries while preserving privacy, fairness, and statistical integrity.

    • Ensures privacy protection by preventing data exposure and leakage.
    • Produces representative samples without amplifying biases in the original dataset.
  • Anomaly Detection: Identify unusual, rare, or suspicious entries that deviate significantly from normal patterns in the data.

Data Augmentation

The topic of data augmentation can be divided into two sub-topics: single table synthesis and multi-relational data synthesis.

Single Table Synthesis

Single table synthesis: generation of an entire table or a specific part of a table (over sampling)

Abbr. Title Venue & Year Code Domain
SOS Sos: Score-based oversampling for tabular data KDD 2022 Stars Generic
STaSy STaSy: Score-based Tabular data Synthesis ICLR 2023 Stars Generic
TabDDPM Tabddpm: Modelling tabular data with diffusion models ICML 2023 Stars Generic
CoDi Codi: Co-evolving contrastive diffusion models for mixed-type tabular synthesis ICML 2023 Stars Generic
MissDiff MissDiff: Training Diffusion Models on Tabular Data with Missing Values ICML Workshop 2023 N/A Generic
AutoDiff AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing NeurIPS Workshop 2023 Stars Generic
DPM-EHR Synthetic health-related longitudinal data with mixed-type variables generated using diffusion models NeurIPS Workshop 2023 Promise to release Healthcare
FinDiff Findiff: Diffusion models for financial tabular data generation ICAIF 2023 Stars Finance
CDTD Continuous Diffusion for Mixed-Type Tabular Data ICLR 2025 Stars Generic
MedDiff MedDiff: Generating electronic health records using accelerated denoising diffusion model ArXiv 2023 N/A Healthcare
EHR-TabDDPM Synthesizing mixed-type electronic health records using diffusion models ArXiv 2023 N/A Healthcare
TabSyn Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space ICLR 2024 Stars Generic
FlexGen-EHR A Flexible Generative Model for Heterogeneous Tabular EHR with Missing Modality ICLR 2024 N/A Healthcare
EHRDiff EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models TMLR 2024 Stars Healthcare
Forest-Diffusion Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees AISTATS 2024 Stars Generic
TabDiff TabDiff: a Unified Diffusion Model for Multi-Modal Tabular Data Generation ICLR 2025 Stars Generic
EntTabDiff Entity-based Financial Tabular Data Synthesis with Diffusion Models ICAIF 2024 N/A Finance
Imb-FinDiff Imb-FinDiff: Conditional Diffusion Models for Class Imbalance Synthesis of Financial Tabular Data ICAIF 2024 N/A Finance
EHR-D3PM Guided discrete diffusion for electronic health record generation ArXiv 2024 N/A Healthcare
TabUnite TabUnite: Efficient Encoding Schemes for Flow and Diffusion Tabular Generative Models OpenReview 2024 Stars Generic
FraudDiffuse FraudDiffuse: Diffusion-aided Synthetic Fraud Augmentation for Improved Fraud Detection ICAIF 2024 N/A Finance
FraudDDPM Synthetic Data Generation for Fraud Detection Using Diffusion Models ISIJ 2024 N/A Finance

Multi-relational Data Synthesis

Multi-relational data synthesis: generation of multiple tables while considering their intercorrelations and constraints

Abbr. Title Venue & Year Code Domain
ClavaDDPM ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models NeurIPS 2024 Stars Generic
GNN-TabSyn Relational Data Generation with Graph Neural Networks and Latent Diffusion Models NeurIPS Workshop 2024 Stars Generic

Data Imputation

Data imputation involves generating plausible values to fill in missing entries in tabular data

Abbr. Title Venue & Year Code Domain
TabCSDI Diffusion models for missing value imputation in tabular data NeurIPS Workshop 2022 Stars Generic
TabDiff TabDiff: a Unified Diffusion Model for Multi-Modal Tabular Data Generation NeurIPS Workshop 2024 Stars Generic
SimpDM Self-supervision improves diffusion models for tabular data imputation CIKM 2024 Stars Generic
MTabGen Diffusion models for tabular data imputation and synthetic data generation ArXiv 2024 Promise to release upon acceptance Generic
DDPM-Perlin Natural generative noise diffusion model imputation KBS 2024 Empty repository Generic
NewImp Rethinking the diffusion models for missing data imputation: A gradient flow perspective NeurIPS 2024 Stars Generic
DiffPuter Unleashing the Potential of Diffusion Models for Incomplete Data Imputation OpenReview 2024 Stars Generic
TabSyn Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space ICLR 2024 Stars Generic
Forest-Diffusion Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees AISTATS 2024 Stars Generic

Trustworthy Data Synthesis

Trustworthy data synthesis aims to generate realistic surrogate values for sensitive entries while keeping the overall utility of the tabular data.

Abbr. Title Venue & Year Code Domain
SiloFuse SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models ICDE 2024 N/A Generic
FedTabDiff FedTabDiff: Federated Learning of Diffusion Probabilistic Models for Synthetic Mixed-Type Tabular Data Generation ArXiv 2024 Stars Generic
FairTabDDPM Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models ArXiv 2024 Stars Generic
DP-Fed-FinDiff Differentially Private Federated Learning of Diffusion Models for Synthetic Tabular Data Generation ArXiv 2024 N/A Finance

Anomaly Detection

In anomaly detecion, diffusion models are used to learn the “normal” distribution of data from the known set and identify anomalies as deviations from this learned distribution in the unseen data.

Abbr. Title Venue & Year Code Domain
TabADM TabADM: Unsupervised Tabular Anomaly Detection with Diffusion Models ArXiv 2023 N/A Generic
DTE On Diffusion Modeling for Anomaly Detection ICLR 2024 Stars Generic
SDAD Self-supervised enhanced denoising diffusion for anomaly detection Inf. Sci. 2024 Under construction Generic
NSCBAD Anomaly Detection by Estimating Gradients of the Tabular Data Distribution OpenReview 2024 Supplementary Material Generic
FraudDiffuse FraudDiffuse: Diffusion-aided Synthetic Fraud Augmentation for Improved Fraud Detection ICAIF 2024 N/A Finance
FraudDDPM Synthetic Data Generation for Fraud Detection Using Diffusion Models ISIJ 2024 N/A Finance

(In Depth) Handling Discrete Data in Diffusion Models

Diffusion models are primarily designed for continous values.

Tabular data often contains discrete values and structured information (e.g., country names or product categories).

To develop robust diffusion models for tabular data, it is crucial to design techniques that intrinsically accommodate discrete data.

Though discussed in our survey paper, we would like to forward the interesting readers to the repository built by Kuleshov Group of Cornell University:

Collection of Datasets

Various datasets have been used to evaluate the performance of diffusion models for tabular data. Most datasets come from the well-established UCI machine learning repository, OpenML collection, and Kaggle platform.

Diffusion Model Usage Across Benchmarking Datasets

Dataset Name Appeared in
Satimage SOS
Shoppers SOS, STaSy, AutoDiff, TabSyn, TabDiff, TabUnite, DiffPuter
Surgical SOS
Buddy SOS, TabDDPM
Default SOS, STaSy, TabDDPM, FinDiff, CDTD, TabSyn, TabDiff, TabUnite, DiffPuter, DP-Fed-FinDiff
Weatheraus SOS
Credit STaSy, DDPM-Perlin
Htru STaSy, AutoDiff
Magic STaSy, AutoDiff, TabSyn, TabDiff, TabUnite, DiffPuter
Phishing STaSy, CoDi
Spambase STaSy
Bean STaSy, AutoDiff, Forest-Diffusion, DiffPuter
Contraceptive STaSy
Crowsource STaSy
Obesity STaSy, CoDi, AutoDiff
Robot STaSy
Shuttle STaSy
Beijing STaSy, CDTD, TabSyn, TabDiff, TabUnite, DiffPuter
News STaSy, AutoDiff, CDTD, TabSyn, TabDiff, TabUnite, DiffPuter
Abalone TabDDPM, AutoDiff, SimpDM, SiloFuse
Adult TabDDPM, AutoDiff, CDTD, TabSyn, TabDiff, Imb-FinDiff, TabUnite, MTabGen, DiffPuter, SiloFuse, FairTabDDPM, DP-Fed-FinDiff
California Housing TabDDPM, Forest-Diffusion, MTabGen
Cardio TabDDPM, TabUnite, MTabGen, SiloFuse
Churn TabDDPM, AutoDiff, CDTD, MTabGen, SiloFuse
Diabetes TabDDPM, CDTD, TabDiff, TabCSDI, SimpDM, SiloFuse, FedTabDiff
Facebook comm. vol. TabDDPM
Gesture TabDDPM, DiffPuter
Higgs small TabDDPM
House 16h TabDDPM
Insurance TabDDPM, CoDi, AutoDiff, MTabGen
King TabDDPM
Miniboone TabDDPM
Wilt TabDDPM, AutoDiff
Bank CoDi, CDTD, TabUnite, FairTabDDPM
Heart CoDi
Seismic CoDi
Stroke CoDi, EHR-TabDDPM, TabUnite
Cmc CoDi
Customer CoDi
Faults CoDi, AutoDiff, TabDiff
Car CoDi, Forest-Diffusion, DDPM-Perlin
Clave CoDi
Nursery CoDi, AutoDiff, DDPM-Perlin
Absent CoDi
Drug CoDi
Census MissDiff, TabCSDI
Mimic4ed MissDiff
Bayesian network (artificial) MissDiff
Indian liver patient AutoDiff, EHR-TabDDPM
Titanic AutoDiff
ART for HIV DPM-EHR
Acute Hypotension DPM-EHR
Philadelphia city payments FinDiff, FedTabDiff
Fund holding FinDiff
Acsincome CDTD
Covertype CDTD
Lending CDTD
Nmes CDTD
MIMIC-III MedDiff, EHR-TabDDPM, FlexGen-EHR, EHRDiff, EHR-D3PM
Patient treatment classification MedDiff
Pima indians diabetes EHR-TabDDPM
eICU FlexGen-EHR
CinC2012 EHRDiff
PTB-ECG EHRDiff
Airfoil Forest-Diffusion, SimpDM
Blood Forest-Diffusion, SimpDM, NewImp
Breast Forest-Diffusion, TabCSDI, NewImp
Climate Forest-Diffusion
Concrete compression Forest-Diffusion, TabCSDI, SimpDM, NewImp
Concrete slump Forest-Diffusion
Connectionist bench sonar Forest-Diffusion
Connectionist bench vowel Forest-Diffusion, NewImp
Ecoli Forest-Diffusion
Glass Forest-Diffusion
Ionosphere Forest-Diffusion, NewImp
Iris Forest-Diffusion, SimpDM
Libras Forest-Diffusion, TabCSDI
Parkinsons Forest-Diffusion, NewImp
Planning relax Forest-Diffusion
Qsar biodegradation Forest-Diffusion, NewImp
Seeds Forest-Diffusion
Wine Forest-Diffusion, TabCSDI, DDPM-Perlin
Wine quality red Forest-Diffusion, SimpDM
Wine quality white Forest-Diffusion, SimpDM, NewImp
Yacht Forest-Diffusion, SimpDM
Yeast Forest-Diffusion, SimpDM
Tic-tac-toe Forest-Diffusion
Congressional voting Forest-Diffusion
Brazil E-commerce EntTabDiff
13F Fund Holdings EntTabDiff
Yelp reviews EntTabDiff
Accounting entries Imb-FinDiff
Philadephia city payments Imb-FinDiff, DP-Fed-FinDiff
IEEE-CIS fraud detection Imb-FinDiff, FraudDiffuse, FraudDDPM
Census synthetic TabUnite
European credit card default FraudDiffuse
Credit card fraud detection FraudDDPM
Online retail FraudDDPM
E-commerce transaction FraudDDPM
California Multi-relational ClavaDDPM
Instacart 05 ClavaDDPM
Berka ClavaDDPM
Movie Lens ClavaDDPM
CCS ClavaDDPM
AirBnB GNN-TabSyn
Biodegradability GNN-TabSyn
CORA GNN-TabSyn
IMDB GNN-TabSyn
Rossmann GNN-TabSyn
Walmart GNN-TabSyn
COVID-19 TabCSDI
Housing SimpDM
Energy SimpDM
German SimpDM
Phoneme SimpDM
Power SimpDM
Ecommerce SimpDM
HELOC MTabGen, SiloFuse
Gas MTabGen
House Sales MTabGen
Otto group MTabGen
Forest Cover MTabGen, SiloFuse
Bike DDPM-Perlin
CPU DDPM-Perlin
Frog DDPM-Perlin
Satellite DDPM-Perlin
Letter DDPM-Perlin, DiffPuter
Turkiye DDPM-Perlin
Loan SiloFuse
Intrusion SiloFuse
COMPAS FairTabDDPM
Marketing data DP-Fed-FinDiff
Adbench TabADM, DTE, SDAD, NSCBAD

Benchmarking Datasets for Various Diffusion Models

Model Abbreviation Datasets Names
SOS Satimage, Shoppers, Surgical, Buddy, Default, Weatheraus
AutoDiff Shoppers, Htru, Magic, Bean, Obesity, News, Abalone, Adult, Churn, Insurance, Wilt, Faults, Nursery, Indian liver patient, Titanic
TabSyn Shoppers, Default, Magic, Beijing, News, Adult
TabDiff Shoppers, Default, Magic, Beijing, News, Adult, Diabetes, Faults
TabUnite Shoppers, Default, Magic, Beijing, News, Adult, Cardio, Bank, Stroke, Census synthetic
DiffPuter Shoppers, Default, Magic, Bean, Beijing, News, Adult, Gesture, California Housing, Letter
TabDDPM Abalone, Adult, Buddy, Default, California Housing, Cardio, Churn, Diabetes, Facebook Comm. Vol., Gesture, Higgs Small, House 16h, Insurance, King, MinibooNE, Wilt
STaSy Default, Credit, Htru, Magic, Phishing, Spambase, Shoppers, Bean, Contraceptive, Crowsource, Obesity, Robot, Shuttle, Beijing, News
FinDiff Default, Philadelphia city payments, Fund holding
CDTD Default, Beijing, News, Adult, Churn, Diabetes, Bank, Acsincome, Covertype, Lending, Nmes
DP-Fed-FinDiff Default, Philadephia city payments, Adult, Marketing data
DDPM-Perlin Credit, Car, Nursery, Wine, Bike, CPU, Frog, Satellite, Letter, Turkiye
CoDi Phishing, Obesity, Insurance, Bank, Heart, Seismic, Stroke, Cmc, Customer, Faults, Car, Clave, Nursery, Absent, Drug
Forest-Diffusion Bean, California Housing, Car, Airfoil, Blood, Breast, Climate, Concrete compression, Concrete slump, Connectionist bench sonar, Connectionist bench vowel, Ecoli, Glass, Ionosphere, Iris, Libras, Parkinsons, Planning relax, Qsar biodegradation, Seeds, Wine, Wine quality red, Wine quality white, Yacht, Yeast, Tic-tac-toe, Congressional voting
SimpDM Abalone, Diabetes, Airfoil, Blood, Concrete compression, Iris, Wine quality red, Wine quality white, Yacht, Yeast, Housing, Energy, German, Phoneme, Power, Ecommerce, California Housing
SiloFuse Abalone, Adult, Cardio, Churn, Diabetes, HELOC, Loan, Forest Cover, Intrusion
Imb-FinDiff Adult, Accounting entries, Philadephia city payments, IEEE-CIS fraud detection
MTabGen Adult, California Housing, Cardio, Churn, Insurance, HELOC, Gas, House Sales, Otto group, Forest Cover
FairTabDDPM Adult, Bank, COMPAS
TabCSDI Diabetes, Census, Breast, Concrete compression, Libras, Wine, COVID-19
FedTabDiff Diabetes, Philadelphia city payments
EHR-TabDDPM Stroke, Indian liver patient, MIMIC-III, Pima indians diabetes
MissDiff Census, Mimic4ed, Bayesian network (artificial)
DPM-EHR ART for HIV, Acute Hypotension
MedDiff MIMIC-III, Patient treatment classification
FlexGen-EHR MIMIC-III, eICU
EHRDiff MIMIC-III, CinC2012, PTB-ECG
EHR-D3PM MIMIC-III, 2 private EHR datasets
NewImp Blood, Breast, Concrete compression, Connectionist bench vowel, Ionosphere, Parkinsons, Qsar biodegradation, Wine quality white
EntTabDiff Brazil E-commerce, 13F Fund Holdings, Yelp reviews
FraudDiffuse IEEE-CIS fraud detection, European credit card default
FraudDDPM IEEE-CIS fraud detection, Credit card fraud detection, Online retail, E-commerce transaction
ClavaDDPM California Multi-relational, Instacart 05, Berka, Movie Lens, CCS
GNN-TabSyn AirBnB, Biodegradability, CORA, IMDB, Rossmann, Walmart
TabADM Adbench
DTE Adbench
SDAD Adbench
NSCBAD Adbench, 15 additional datasets from icl + elki + ex-ae + odds

License

This project is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the material as long as you give proper credit.

Full license details: CC-BY 4.0 License