vimar-gu/Awesome-Dataset-Distillation

Awesome Dataset Distillation Papers

MIT

Awesome Dataset Distillation

A curated list of awesome papers on dataset distillation and related applications.

Dataset distillation is the task of synthesizing a small dataset such that models trained on it achieve high performance on the original large dataset. A dataset distillation algorithm takes as input a large real dataset to be distilled (training set), and outputs a small synthetic distilled dataset, which is evaluated via testing models trained on this distilled dataset on a separate real dataset (validation/test set). A good small distilled dataset is not only useful in dataset understanding, but has various applications (e.g., continual learning, privacy, neural architecture search, etc.). This task was first introduced in the 2018 paper Dataset Distillation [Tongzhou Wang et al., '18], along with a proposed algorithm using backpropagation through optimization steps. Then the task was first extended to the real-world datasets in the paper Medical Dataset Distillation [Guang Li et al., '20], which also explored the privacy preservation possibilities of dataset distillation. In the paper Dataset Condensation [Bo Zhao et al., '21], gradient matching was first introduced and greatly promoted the development of the dataset distillation field.

In recent years (2022-now), dataset distillation has gained increasing attention in the research community, across many institutes and labs. More papers are now being published each year. These wonderful researches have been constantly improving dataset distillation and exploring its various variants and applications.

This project is curated and maintained by Guang Li, Bo Zhao, and Tongzhou Wang.

How to submit a pull request?

🌐 Project Page
Code
📖 bibtex

Citing Awesome Dataset Distillation

If you find this project useful for your research, please use the following BibTeX entry.

@misc{li2022awesome,
  author={Li, Guang and Zhao, Bo and Wang, Tongzhou},
  title={Awesome Dataset Distillation},
  howpublished={\url{https://github.com/Guang000/Awesome-Dataset-Distillation}},
  year={2022}
}

Contents

Main
Applications

Media Coverage
Acknowledgments

Main

Dataset Distillation (Tongzhou Wang et al., 2018) 🌐 📖

Early Work

Gradient-Based Hyperparameter Optimization Through Reversible Learning (Dougal Maclaurin et al., ICML 2015) 📖

Gradient/Trajectory Matching Surrogate Objective

Dataset Condensation with Gradient Matching (Bo Zhao et al., ICLR 2021) 📖
Dataset Condensation with Differentiable Siamese Augmentation (Bo Zhao et al., ICML 2021) 📖
Dataset Distillation by Matching Training Trajectories (George Cazenavette et al., CVPR 2022) 🌐 📖
Dataset Condensation with Contrastive Signals (Saehyung Lee et al., ICML 2022) 📖
Loss-Curvature Matching for Dataset Selection and Condensation (Seungjae Shin & Heesun Bae et al., AISTATS 2023) 📖
Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation (Jiawei Du & Yidi Jiang et al., CVPR 2023) 📖
Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory (Justin Cui et al., ICML 2023) 📖
Sequential Subset Matching for Dataset Distillation (Jiawei Du et al., NeurIPS 2023) 📖
Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching (Ziyao Guo et al., ICLR 2024) 🌐 📖

Distribution/Feature Matching Surrogate Objective

CAFE: Learning to Condense Dataset by Aligning Features (Kai Wang & Bo Zhao et al., CVPR 2022) 📖
Dataset Condensation with Distribution Matching (Bo Zhao et al., WACV 2023) 📖
Improved Distribution Matching for Dataset Condensation (Ganlong Zhao et al., CVPR 2023) 📖
DataDAM: Efficient Dataset Distillation with Attention Matching (Ahmad Sajedi & Samir Khaki, ICCV 2023) 🌐 📖
M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy (Hansong Zhang & Shikun Li et al., AAAI 2024) 📖
On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm (Peng Sun et al., CVPR 2024) 📖

Better Optimization

Optimizing Millions of Hyperparameters by Implicit Differentiation (Jonathan Lorraine et al., AISTATS 2020) 📖
Dataset Meta-Learning from Kernel Ridge-Regression (Timothy Nguyen et al., ICLR 2021) 📖
Dataset Distillation with Infinitely Wide Convolutional Networks (Timothy Nguyen et al., NeurIPS 2021) 📖
On Implicit Bias in Overparameterized Bilevel Optimization (Paul Vicol et al., ICML 2022) 📖
Dataset Distillation using Neural Feature Regression (Yongchao Zhou et al., NeurIPS 2022) 🌐 📖
Efficient Dataset Distillation using Random Feature Approximation (Noel Loo et al., NeurIPS 2022) 📖
Accelerating Dataset Distillation via Model Augmentation (Lei Zhang & Jie Zhang et al., CVPR 2023) 📖
Dataset Distillation with Convexified Implicit Gradients (Noel Loo et al., ICML 2023) 📖
DREAM: Efficient Dataset Distillation by Representative Matching (Yanqing Liu & Jianyang Gu et al., ICCV 2023) 📖
On the Size and Approximation Error of Distilled Sets (Alaa Maalouf & Murad Tukan, NeurIPS 2023) 📖
Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective (Zeyuan Yin & Zhiqiang Shen et al., NeurIPS 2023) 🌐 📖
You Only Condense Once: Two Rules for Pruning Condensed Datasets (Yang He et al., NeurIPS 2023) 📖
MIM4DD: Mutual Information Maximization for Dataset Distillation (Yuzhang Shang et al., NeurIPS 2023) 📖
MGDD: A Meta Generator for Fast Dataset Distillation (Songhua Liu et al., NeurIPS 2023) 📖
Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection (Yue Xu et al., 2023) 📖
Can Pre-Trained Models Assist in Dataset Distillation? (Yao Lu et al., 2023) 📖
DREAM+: Efficient Dataset Distillation by Bidirectional Representative Matching (Yanqing Liu & Jianyang Gu et al., 2023) 📖
Dataset Distillation in Latent Space (Yuxuan Duan et al., 2023) 📖
Embarassingly Simple Dataset Distillation (Yunzhen Feng et al., ICLR 2024) 📖
Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching (Shitong Shao et al., CVPR 2024) 📖
Group Distributionally Robust Dataset Distillation with Risk Minimization (Saeed Vahidian & Mingyu Wang & Jianyang Gu et al., 2024) 📖

Distilled Dataset Parametrization

Dataset Condensation via Efficient Synthetic-Data Parameterization (Jang-Hyun Kim et al., ICML 2022) 📖
Remember the Past: Distilling Datasets into Addressable Memories for Neural Networks (Zhiwei Deng et al., NeurIPS 2022) 📖
On Divergence Measures for Bayesian Pseudocoresets (Balhae Kim et al., NeurIPS 2022) 📖
Dataset Distillation via Factorization (Songhua Liu et al., NeurIPS 2022) 📖
PRANC: Pseudo RAndom Networks for Compacting Deep Models (Parsa Nooralinejad et al., 2022) 📖
Dataset Condensation with Latent Space Knowledge Factorization and Sharing (Hae Beom Lee & Dong Bok Lee et al., 2022) 📖
Slimmable Dataset Condensation (Songhua Liu et al., CVPR 2023) 📖
Few-Shot Dataset Distillation via Translative Pre-Training (Songhua Liu et al., ICCV 2023) 📖
Sparse Parameterization for Epitomic Dataset Distillation (Xing Wei & Anjia Cao et al., NeurIPS 2023) 📖
Frequency Domain-based Dataset Distillation (Donghyeok Shin & Seungjae Shin et al., NeurIPS 2023) 📖

Generative Prior

Synthesizing Informative Training Samples with GAN (Bo Zhao et al., NeurIPS 2022 Workshop) 📖
Generalizing Dataset Distillation via Deep Generative Prior (George Cazenavette et al., CVPR 2023) 🌐 📖
DiM: Distilling Dataset into Generative Model (Kai Wang & Jianyang Gu et al., 2023) 📖
Dataset Condensation via Generative Model (Junhao Zhang et al., 2023) 📖
Efficient Dataset Distillation via Minimax Diffusion (Jianyang Gu et al., CVPR 2024) 📖

Label Distillation

Flexible Dataset Distillation: Learn Labels Instead of Images (Ondrej Bohdal et al., NeurIPS 2020 Workshop) 📖
Soft-Label Dataset Distillation and Text Dataset Distillation (Ilia Sucholutsky et al., IJCNN 2021) 📖

Dataset Quantization

Dataset Quantization (Daquan Zhou & Kai Wang & Jianyang Gu et al., ICCV 2023) 📖

Multimodal Distillation

Vision-Language Dataset Distillation (Xindi Wu et al., 2023) 🌐 📖

Self-Supervised Distillation

Self-Supervised Dataset Distillation for Transfer Learning (Dong Bok Lee & Seanie Lee et al., ICLR 2024) 📖

Benchmark

DC-BENCH: Dataset Condensation Benchmark (Justin Cui et al., NeurIPS 2022) 🌐 📖
A Comprehensive Study on Dataset Distillation: Performance, Privacy, Robustness and Fairness) (Zongxiong Chen & Jiahui Geng et al., 2023) 📖

Survey

Data Distillation: A Survey (Noveen Sachdeva et al., TMLR 2023) 📖
A Survey on Dataset Distillation: Approaches, Applications and Future Directions (Jiahui Geng & Zongxiong Chen et al., IJCAI 2023) 📖
A Comprehensive Survey to Dataset Distillation (Shiye Lei et al., TPAMI 2023) 📖
Dataset Distillation: A Comprehensive Review (Ruonan Yu & Songhua Liu et al., TPAMI 2023) 📖

Ph.D. Thesis

Data-efficient Neural Network Training with Dataset Condensation (Bo Zhao, The University of Edinburgh 2023) 📖

Workshop

1st Workshop on Dataset Distillation for Computer Vision (Saeed Vahidian et al., CVPR 2024)

Applications

Continual Learning

Reducing Catastrophic Forgetting with Learning on Synthetic Data (Wojciech Masarczyk et al., CVPR 2020 Workshop) 📖
Condensed Composite Memory Continual Learning (Felix Wiewel et al., IJCNN 2021) 📖
Distilled Replay: Overcoming Forgetting through Synthetic Samples (Andrea Rosasco et al., IJCAI 2021 Workshop) 📖
Sample Condensation in Online Continual Learning (Mattia Sangermano et al., IJCNN 2022) 📖
An Efficient Dataset Condensation Plugin and Its Application to Continual Learning (Enneng Yang et al., NeurIPS 2023) 📖
Summarizing Stream Data for Memory-Restricted Online Continual Learning (Jianyang Gu et al., AAAI 2024) 📖

Privacy

SecDD: Efficient and Secure Method for Remotely Training Neural Networks (Ilia Sucholutsky et al., AAAI 2021) 📖
Privacy for Free: How does Dataset Condensation Help Privacy? (Tian Dong et al., ICML 2022) 📖
No Free Lunch in "Privacy for Free: How does Dataset Condensation Help Privacy" (Nicholas Carlini et al., 2022) 📖
Can We Achieve Robustness from Data Alone? (Nikolaos Tsilivis et al., ICML 2022 Workshop) 📖
Private Set Generation with Discriminative Information (Dingfan Chen et al., NeurIPS 2022) 📖
Towards Robust Dataset Learning (Yihan Wu et al., 2022) 📖
Backdoor Attacks Against Dataset Distillation (Yugeng Liu et al., NDSS 2023) 📖
Differentially Private Kernel Inducing Points (DP-KIP) for Privacy-preserving Data Distillation (Margarita Vinaroz et al., 2023) 📖
Understanding Reconstruction Attacks with the Neural Tangent Kernel and Dataset Distillation (Noel Loo et al., ICLR 2024) 📖
Rethinking Backdoor Attacks on Dataset Distillation: A Kernel Method Perspective (Ming-Yu Chung et al., ICLR 2024) 📖

Medical

Soft-Label Anonymous Gastric X-ray Image Distillation (Guang Li et al., ICIP 2020) 📖
Compressed Gastric Image Generation Based on Soft-Label Dataset Distillation for Medical Data Sharing (Guang Li et al., CMPB 2022) 📖
Dataset Distillation for Medical Dataset Sharing (Guang Li et al., AAAI 2023 Workshop) 📖
Communication-Efficient Federated Skin Lesion Classification with Generalizable Dataset Distillation (Yuchen Tian & Jiacheng Wang, MICCAI 2023 Workshop) 📖

Federated Learning

Federated Learning via Synthetic Data (Jack Goetz et al., 2020) 📖
Distilled One-Shot Federated Learning (Yanlin Zhou et al., 2020) 📖
DENSE: Data-Free One-Shot Federated Learning (Jie Zhang & Chen Chen et al., NeurIPS 2022) 📖
FedSynth: Gradient Compression via Synthetic Data in Federated Learning (Shengyuan Hu et al., 2022) 📖
DYNAFED: Tackling Client Data Heterogeneity with Global Dynamics (Renjie Pi et al., 2022) 📖
Meta Knowledge Condensation for Federated Learning (Ping Liu et al., ICLR 2023) 📖
FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning (Yuanhao Xiong & Ruochen Wang et al., CVPR 2023) 📖
Federated Learning via Decentralized Dataset Distillation in Resource-Constrained Edge Environments (Rui Song et al., IJCNN 2023) 📖
Fed-GLOSS-DP: Federated, Global Learning using Synthetic Sets with Record Level Differential Privacy (Hui-Po Wang et al., 2023) 📖
Federated Virtual Learning on Heterogeneous Data with Local-global Distillation (Chun-Yin Huang et al., 2023) 📖

Graph Neural Network

Graph Condensation for Graph Neural Networks (Wei Jin et al., ICLR 2022) 📖
Condensing Graphs via One-Step Gradient Matching (Wei Jin et al., KDD 2022) 📖
Graph Condensation via Receptive Field Distribution Matching (Mengyang Liu et al., 2022) 📖
CaT: Balanced Continual Graph Learning with Graph Condensation (Liu Yilun et al., ICDM 2023) 📖
Structure-free Graph Condensation: From Large-scale Graphs to Condensed Graph-free Data (Xin Zheng et al., NeurIPS 2023) 📖
Does Graph Distillation See Like Vision Dataset Counterpart? (Beining Yang & Kai Wang et al., NeurIPS 2023) 📖
Fair Graph Distillation (Qizhang Feng et al., NeurIPS 2023) 📖

Neural Architecture Search

Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data (Felipe Petroski Such et al., ICML 2020) 📖
Learning to Generate Synthetic Training Data using Gradient Matching and Implicit Differentiation (Dmitry Medvedev et al., AIST 2021) 📖

Fashion, Art, and Design

Wearable ImageNet: Synthesizing Tileable Textures via Dataset Distillation (George Cazenavette et al., CVPR 2022 Workshop) 🌐 📖
Learning from Designers: Fashion Compatibility Analysis Via Dataset Distillation (Yulan Chen et al., ICIP 2022) 📖
Galaxy Dataset Distillation with Self-Adaptive Trajectory Matching (Haowen Guan et al., NeurIPS 2023 Workshop) 📖

Knowledge Distillation

Knowledge Condensation Distillation (Chenxin Li et al., ECCV 2022) 📖

Recommender Systems

Infinite Recommendation Networks: A Data-Centric Approach (Noveen Sachdeva et al., NeurIPS 2022) 📖
Gradient Matching for Categorical Data Distillation in CTR Prediction (Chen Wang et al., RecSys 2023) 📖

Blackbox Optimization

Bidirectional Learning for Offline Infinite-width Model-based Optimization (Can Chen et al., NeurIPS 2022) 📖
Bidirectional Learning for Offline Model-based Biological Sequence Design (Can Chen et al., ICML 2023) 📖

Trustworthy

Rethinking Data Distillation: Do Not Overlook Calibration (Dongyao Zhu et al., ICCV 2023) 📖
Towards Trustworthy Dataset Distillation (Shijie Ma et al., 2023) 📖

Retrieval

Towards Efficient Deep Hashing Retrieval: Condensing Your Data via Feature-Embedding Matching (Tao Feng & Jie Zhang et al., 2023) 📖

Text

Data Distillation for Text Classification (Yongqi Li et al., 2021) 📖
Dataset Distillation with Attention Labels for Fine-tuning BERT (Aru Maekawa et al., ACL 2023) 📖

Tabular

New Properties of the Data Distillation Method When Working With Tabular Data (Dmitry Medvedev et al., AIST 2020) 📖

Media Coverage

Acknowledgments

We want to thank Nikolaos Tsilivis, Wei Jin, Yongchao Zhou, Noveen Sachdeva, Can Chen, Guangxiang Zhao, Shiye Lei, Xinchao Wang, Dmitry Medvedev, Seungjae Shin, Jiawei Du, Yidi Jiang, Xindi Wu, Guangyi Liu, Yilun Liu, Kai Wang, Yue Xu, Anjia Cao, Jianyang Gu, Yuanzhen Feng, Peng Sun, Ahmad Sajedifor, and Zhihao Sui their valuable suggestions and contributions.