We explore recent advancements in diffusion models for tabular data and highlight key challenges, current progress, and future directions.
📖 You are welcome to read our paper and share your feedback!
👉 Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions (under review)
If you find our survey and this repository helpful, please star this project and cite our paper:
@misc{liDiffusion2025,
title = {Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions},
author = {Zhong Li, Qi Huang*, Lincen Yang*, Jiayang Shi, Zhao Yang, Niki van Stein, Thomas Bäck, Matthijs van Leeuwen},
year = {2025},
month = {February},
primaryclass = {cs},
doi = {}
}
- Awesome Diffusion Models For Tabular Data
- Table of Contents
- Timeline of GenAI for Tabular Data
- Taxonomy of Diffusion Models for Tabular Data
- Data Augmentation
- Data Imputation
- Trustworthy Data Synthesis
- Anomaly Detection
- (In Depth) Handling Discrete Data in Diffusion Models
- Collection of Datasets
Research on generative models for tabular data is primarily motivated by real-world applications. Based on their usage, we classify existing studies into four main categories:
-
Data Augmentation: Artificially generate new tables or entries from existing datasets.
- Commonly used to address class imbalance in classification tasks.
- Enhances the robustness and performance of machine learning models.
-
Data Imputation: Fill in missing or incomplete entries within existing tables.
-
Trustworthy Data Synthesis: Generate entirely new synthetic tables or entries while preserving privacy, fairness, and statistical integrity.
- Ensures privacy protection by preventing data exposure and leakage.
- Produces representative samples without amplifying biases in the original dataset.
-
Anomaly Detection: Identify unusual, rare, or suspicious entries that deviate significantly from normal patterns in the data.
The topic of data augmentation can be divided into two sub-topics: single table synthesis and multi-relational data synthesis.
Single table synthesis: generation of an entire table or a specific part of a table (over sampling)
Multi-relational data synthesis: generation of multiple tables while considering their intercorrelations and constraints
Abbr. | Title | Venue & Year | Code | Domain |
---|---|---|---|---|
ClavaDDPM | ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models | NeurIPS 2024 | Generic | |
GNN-TabSyn | Relational Data Generation with Graph Neural Networks and Latent Diffusion Models | NeurIPS Workshop 2024 | Generic |
Data imputation involves generating plausible values to fill in missing entries in tabular data
Trustworthy data synthesis aims to generate realistic surrogate values for sensitive entries while keeping the overall utility of the tabular data.
Abbr. | Title | Venue & Year | Code | Domain |
---|---|---|---|---|
SiloFuse | SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models | ICDE 2024 | N/A | Generic |
FedTabDiff | FedTabDiff: Federated Learning of Diffusion Probabilistic Models for Synthetic Mixed-Type Tabular Data Generation | ArXiv 2024 | Generic | |
FairTabDDPM | Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models | ArXiv 2024 | Generic | |
DP-Fed-FinDiff | Differentially Private Federated Learning of Diffusion Models for Synthetic Tabular Data Generation | ArXiv 2024 | N/A | Finance |
In anomaly detecion, diffusion models are used to learn the “normal” distribution of data from the known set and identify anomalies as deviations from this learned distribution in the unseen data.
Abbr. | Title | Venue & Year | Code | Domain |
---|---|---|---|---|
TabADM | TabADM: Unsupervised Tabular Anomaly Detection with Diffusion Models | ArXiv 2023 | N/A | Generic |
DTE | On Diffusion Modeling for Anomaly Detection | ICLR 2024 | Generic | |
SDAD | Self-supervised enhanced denoising diffusion for anomaly detection | Inf. Sci. 2024 | Under construction | Generic |
NSCBAD | Anomaly Detection by Estimating Gradients of the Tabular Data Distribution | OpenReview 2024 | Supplementary Material | Generic |
FraudDiffuse | FraudDiffuse: Diffusion-aided Synthetic Fraud Augmentation for Improved Fraud Detection | ICAIF 2024 | N/A | Finance |
FraudDDPM | Synthetic Data Generation for Fraud Detection Using Diffusion Models | ISIJ 2024 | N/A | Finance |
Diffusion models are primarily designed for continous values.
Tabular data often contains discrete values and structured information (e.g., country names or product categories).
To develop robust diffusion models for tabular data, it is crucial to design techniques that intrinsically accommodate discrete data.
Though discussed in our survey paper, we would like to forward the interesting readers to the repository built by Kuleshov Group of Cornell University:
Various datasets have been used to evaluate the performance of diffusion models for tabular data. Most datasets come from the well-established UCI machine learning repository, OpenML collection, and Kaggle platform.
Dataset Name | Appeared in |
---|---|
Satimage | SOS |
Shoppers | SOS, STaSy, AutoDiff, TabSyn, TabDiff, TabUnite, DiffPuter |
Surgical | SOS |
Buddy | SOS, TabDDPM |
Default | SOS, STaSy, TabDDPM, FinDiff, CDTD, TabSyn, TabDiff, TabUnite, DiffPuter, DP-Fed-FinDiff |
Weatheraus | SOS |
Credit | STaSy, DDPM-Perlin |
Htru | STaSy, AutoDiff |
Magic | STaSy, AutoDiff, TabSyn, TabDiff, TabUnite, DiffPuter |
Phishing | STaSy, CoDi |
Spambase | STaSy |
Bean | STaSy, AutoDiff, Forest-Diffusion, DiffPuter |
Contraceptive | STaSy |
Crowsource | STaSy |
Obesity | STaSy, CoDi, AutoDiff |
Robot | STaSy |
Shuttle | STaSy |
Beijing | STaSy, CDTD, TabSyn, TabDiff, TabUnite, DiffPuter |
News | STaSy, AutoDiff, CDTD, TabSyn, TabDiff, TabUnite, DiffPuter |
Abalone | TabDDPM, AutoDiff, SimpDM, SiloFuse |
Adult | TabDDPM, AutoDiff, CDTD, TabSyn, TabDiff, Imb-FinDiff, TabUnite, MTabGen, DiffPuter, SiloFuse, FairTabDDPM, DP-Fed-FinDiff |
California Housing | TabDDPM, Forest-Diffusion, MTabGen |
Cardio | TabDDPM, TabUnite, MTabGen, SiloFuse |
Churn | TabDDPM, AutoDiff, CDTD, MTabGen, SiloFuse |
Diabetes | TabDDPM, CDTD, TabDiff, TabCSDI, SimpDM, SiloFuse, FedTabDiff |
Facebook comm. vol. | TabDDPM |
Gesture | TabDDPM, DiffPuter |
Higgs small | TabDDPM |
House 16h | TabDDPM |
Insurance | TabDDPM, CoDi, AutoDiff, MTabGen |
King | TabDDPM |
Miniboone | TabDDPM |
Wilt | TabDDPM, AutoDiff |
Bank | CoDi, CDTD, TabUnite, FairTabDDPM |
Heart | CoDi |
Seismic | CoDi |
Stroke | CoDi, EHR-TabDDPM, TabUnite |
Cmc | CoDi |
Customer | CoDi |
Faults | CoDi, AutoDiff, TabDiff |
Car | CoDi, Forest-Diffusion, DDPM-Perlin |
Clave | CoDi |
Nursery | CoDi, AutoDiff, DDPM-Perlin |
Absent | CoDi |
Drug | CoDi |
Census | MissDiff, TabCSDI |
Mimic4ed | MissDiff |
Bayesian network (artificial) | MissDiff |
Indian liver patient | AutoDiff, EHR-TabDDPM |
Titanic | AutoDiff |
ART for HIV | DPM-EHR |
Acute Hypotension | DPM-EHR |
Philadelphia city payments | FinDiff, FedTabDiff |
Fund holding | FinDiff |
Acsincome | CDTD |
Covertype | CDTD |
Lending | CDTD |
Nmes | CDTD |
MIMIC-III | MedDiff, EHR-TabDDPM, FlexGen-EHR, EHRDiff, EHR-D3PM |
Patient treatment classification | MedDiff |
Pima indians diabetes | EHR-TabDDPM |
eICU | FlexGen-EHR |
CinC2012 | EHRDiff |
PTB-ECG | EHRDiff |
Airfoil | Forest-Diffusion, SimpDM |
Blood | Forest-Diffusion, SimpDM, NewImp |
Breast | Forest-Diffusion, TabCSDI, NewImp |
Climate | Forest-Diffusion |
Concrete compression | Forest-Diffusion, TabCSDI, SimpDM, NewImp |
Concrete slump | Forest-Diffusion |
Connectionist bench sonar | Forest-Diffusion |
Connectionist bench vowel | Forest-Diffusion, NewImp |
Ecoli | Forest-Diffusion |
Glass | Forest-Diffusion |
Ionosphere | Forest-Diffusion, NewImp |
Iris | Forest-Diffusion, SimpDM |
Libras | Forest-Diffusion, TabCSDI |
Parkinsons | Forest-Diffusion, NewImp |
Planning relax | Forest-Diffusion |
Qsar biodegradation | Forest-Diffusion, NewImp |
Seeds | Forest-Diffusion |
Wine | Forest-Diffusion, TabCSDI, DDPM-Perlin |
Wine quality red | Forest-Diffusion, SimpDM |
Wine quality white | Forest-Diffusion, SimpDM, NewImp |
Yacht | Forest-Diffusion, SimpDM |
Yeast | Forest-Diffusion, SimpDM |
Tic-tac-toe | Forest-Diffusion |
Congressional voting | Forest-Diffusion |
Brazil E-commerce | EntTabDiff |
13F Fund Holdings | EntTabDiff |
Yelp reviews | EntTabDiff |
Accounting entries | Imb-FinDiff |
Philadephia city payments | Imb-FinDiff, DP-Fed-FinDiff |
IEEE-CIS fraud detection | Imb-FinDiff, FraudDiffuse, FraudDDPM |
Census synthetic | TabUnite |
European credit card default | FraudDiffuse |
Credit card fraud detection | FraudDDPM |
Online retail | FraudDDPM |
E-commerce transaction | FraudDDPM |
California Multi-relational | ClavaDDPM |
Instacart 05 | ClavaDDPM |
Berka | ClavaDDPM |
Movie Lens | ClavaDDPM |
CCS | ClavaDDPM |
AirBnB | GNN-TabSyn |
Biodegradability | GNN-TabSyn |
CORA | GNN-TabSyn |
IMDB | GNN-TabSyn |
Rossmann | GNN-TabSyn |
Walmart | GNN-TabSyn |
COVID-19 | TabCSDI |
Housing | SimpDM |
Energy | SimpDM |
German | SimpDM |
Phoneme | SimpDM |
Power | SimpDM |
Ecommerce | SimpDM |
HELOC | MTabGen, SiloFuse |
Gas | MTabGen |
House Sales | MTabGen |
Otto group | MTabGen |
Forest Cover | MTabGen, SiloFuse |
Bike | DDPM-Perlin |
CPU | DDPM-Perlin |
Frog | DDPM-Perlin |
Satellite | DDPM-Perlin |
Letter | DDPM-Perlin, DiffPuter |
Turkiye | DDPM-Perlin |
Loan | SiloFuse |
Intrusion | SiloFuse |
COMPAS | FairTabDDPM |
Marketing data | DP-Fed-FinDiff |
Adbench | TabADM, DTE, SDAD, NSCBAD |
Model Abbreviation | Datasets Names |
---|---|
SOS | Satimage, Shoppers, Surgical, Buddy, Default, Weatheraus |
AutoDiff | Shoppers, Htru, Magic, Bean, Obesity, News, Abalone, Adult, Churn, Insurance, Wilt, Faults, Nursery, Indian liver patient, Titanic |
TabSyn | Shoppers, Default, Magic, Beijing, News, Adult |
TabDiff | Shoppers, Default, Magic, Beijing, News, Adult, Diabetes, Faults |
TabUnite | Shoppers, Default, Magic, Beijing, News, Adult, Cardio, Bank, Stroke, Census synthetic |
DiffPuter | Shoppers, Default, Magic, Bean, Beijing, News, Adult, Gesture, California Housing, Letter |
TabDDPM | Abalone, Adult, Buddy, Default, California Housing, Cardio, Churn, Diabetes, Facebook Comm. Vol., Gesture, Higgs Small, House 16h, Insurance, King, MinibooNE, Wilt |
STaSy | Default, Credit, Htru, Magic, Phishing, Spambase, Shoppers, Bean, Contraceptive, Crowsource, Obesity, Robot, Shuttle, Beijing, News |
FinDiff | Default, Philadelphia city payments, Fund holding |
CDTD | Default, Beijing, News, Adult, Churn, Diabetes, Bank, Acsincome, Covertype, Lending, Nmes |
DP-Fed-FinDiff | Default, Philadephia city payments, Adult, Marketing data |
DDPM-Perlin | Credit, Car, Nursery, Wine, Bike, CPU, Frog, Satellite, Letter, Turkiye |
CoDi | Phishing, Obesity, Insurance, Bank, Heart, Seismic, Stroke, Cmc, Customer, Faults, Car, Clave, Nursery, Absent, Drug |
Forest-Diffusion | Bean, California Housing, Car, Airfoil, Blood, Breast, Climate, Concrete compression, Concrete slump, Connectionist bench sonar, Connectionist bench vowel, Ecoli, Glass, Ionosphere, Iris, Libras, Parkinsons, Planning relax, Qsar biodegradation, Seeds, Wine, Wine quality red, Wine quality white, Yacht, Yeast, Tic-tac-toe, Congressional voting |
SimpDM | Abalone, Diabetes, Airfoil, Blood, Concrete compression, Iris, Wine quality red, Wine quality white, Yacht, Yeast, Housing, Energy, German, Phoneme, Power, Ecommerce, California Housing |
SiloFuse | Abalone, Adult, Cardio, Churn, Diabetes, HELOC, Loan, Forest Cover, Intrusion |
Imb-FinDiff | Adult, Accounting entries, Philadephia city payments, IEEE-CIS fraud detection |
MTabGen | Adult, California Housing, Cardio, Churn, Insurance, HELOC, Gas, House Sales, Otto group, Forest Cover |
FairTabDDPM | Adult, Bank, COMPAS |
TabCSDI | Diabetes, Census, Breast, Concrete compression, Libras, Wine, COVID-19 |
FedTabDiff | Diabetes, Philadelphia city payments |
EHR-TabDDPM | Stroke, Indian liver patient, MIMIC-III, Pima indians diabetes |
MissDiff | Census, Mimic4ed, Bayesian network (artificial) |
DPM-EHR | ART for HIV, Acute Hypotension |
MedDiff | MIMIC-III, Patient treatment classification |
FlexGen-EHR | MIMIC-III, eICU |
EHRDiff | MIMIC-III, CinC2012, PTB-ECG |
EHR-D3PM | MIMIC-III, 2 private EHR datasets |
NewImp | Blood, Breast, Concrete compression, Connectionist bench vowel, Ionosphere, Parkinsons, Qsar biodegradation, Wine quality white |
EntTabDiff | Brazil E-commerce, 13F Fund Holdings, Yelp reviews |
FraudDiffuse | IEEE-CIS fraud detection, European credit card default |
FraudDDPM | IEEE-CIS fraud detection, Credit card fraud detection, Online retail, E-commerce transaction |
ClavaDDPM | California Multi-relational, Instacart 05, Berka, Movie Lens, CCS |
GNN-TabSyn | AirBnB, Biodegradability, CORA, IMDB, Rossmann, Walmart |
TabADM | Adbench |
DTE | Adbench |
SDAD | Adbench |
NSCBAD | Adbench, 15 additional datasets from icl + elki + ex-ae + odds |
This project is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the material as long as you give proper credit.
Full license details: CC-BY 4.0 License