Awesome-Data-Centric-AI

A curated, but incomplete, list of data-centric AI resources. It should be noted that it is unfeasible to encompass every paper. Thus, we prefer to selectively choose papers that present a range of distinct ideas. We welcome contributions to further enrich and refine this list.

If you want to contribute to this list, please feel free to send a pull request. Also you can contact daochen.zha@rice.edu.

Survey paper: Data-centric Artificial Intelligence: A Survey
Perspective paper (SDM 2023): Data-centric AI: Perspectives and Challenges
Towards Data Science: What Are the Data-Centric AI Concepts behind GPT Models?
知乎解读: GPT模型成功的背后用到了哪些以数据为中心的人工智能（Data-centric AI）技术？

What is Data-centric AI?

Data-centric AI is an emerging field that focuses on engineering data to improve AI systems with enhanced data quality and quantity.

Data-centric AI vs. Model-centric AI

In the conventional model-centric AI lifecycle, researchers and developers primarily focus on identifying more effective models to improve AI performance while keeping the data largely unchanged. However, this model-centric paradigm overlooks the potential quality issues and undesirable flaws of data, such as missing values, incorrect labels, and anomalies. Complementing the existing efforts in model advancement, data-centric AI emphasizes the systematic engineering of data to build AI systems, shifting our focus from model to data.

It is important to note that "data-centric" differs fundamentally from "data-driven", as the latter only emphasizes the use of data to guide AI development, which typically still centers on developing models rather than engineering data.

Why Data-centric AI?

We give two motivating examples to highlight the central role of data in AI.

On the left, large and high-quality training data are the driving force of recent successes of GPT models, while model architectures remain similar, except for more model weights.
On the right, when the model becomes sufficiently powerful, we only need to engineer prompts (inference data) to accomplish our objectives, with the model being fixed.

What is the Data-centric AI Framework?

Data-centric AI framework consists of three goals: training data development, inference data development, and data maintenance, where each goal is associated with several sub-goals.

The goal of training data development is to collect and produce rich and high-quality training data to support the training of machine learning models.
The objective of inference data development is to create novel evaluation sets that can provide more granular insights into the model or trigger a specific capability of the model with engineered data inputs.
The purpose of data maintenance is to ensure the quality and reliability of data in a dynamic environment.

Cite this Work

Zha, Daochen, et al. "Data-centric Artificial Intelligence: A Survey." arXiv preprint arXiv:2303.10158, 2023.

@article{zha2023data-centric-survey,
  title={Data-centric Artificial Intelligence: A Survey},
  author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Jiang, Zhimeng and Zhong, Shaochen and Hu, Xia},
  journal={arXiv preprint arXiv:2303.10158},
  year={2013}
}

Zha, Daochen, et al. "Data-centric AI: Perspectives and Challenges." SDM, 2023.

@inproceedings{zha2023data-centric-perspectives,
  title={Data-centric AI: Perspectives and Challenges},
  author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Hu, Xia},
  booktitle={SDM},
  year={2023}
}

Training Data Development
Inference Data Development
Data Maintenance
Data Benchmark

Training Data Development

Data Collection

Revisiting time series outlier detection: Definitions and benchmarks, NeurIPS 2021 [Paper] [Code]
Dataset discovery in data lakes, ICDE 2020 [Paper]
Aurum: A data discovery system, ICDE 2018 [Paper] [Code]
Table union search on open data, VLDB 2018 [Paper]
Data Integration: The Current Status and the Way Forward, IEEE Computer Society Technical Committee on Data Engineering 2018 [Paper]
To join or not to join? thinking twice about joins before feature selection, SIGMOD 2016 [Paper]
Data curation at scale: the data tamer system, CIDR 2013 [Paper]
Data integration: A theoretical perspective, PODS 2002 [Paper]

Data Labeling

Active-Learning-as-a-Service: An Efficient MLOps System for Data-Centric AI, NeurIPS 2022 Workshop on Human in the Loop Learning [paper] [code]
Active Ensemble Learning for Knowledge Graph Error Detection, WSDM 2023 [Paper]
Training language models to follow instructions with human feedback, NeurIPS 2022 [Paper]
Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling, ICLR 2021 [Paper] [Code]
A survey of deep active learning, ACM Computing Surveys 2021 [Paper]
Adaptive rule discovery for labeling text data, SIGMOD 2021 [Paper]
Cut out the annotator, keep the cutout: better segmentation with weak supervision, ICLR 2021 [Paper]
Meta-AAD: Active anomaly detection with deep reinforcement learning, ICDM 2020 [Paper] [Code]
Snorkel: Rapid training data creation with weak supervision, VLDB 2020 [Paper] [Code]
Graph-based semi-supervised learning: A review, Neurocomputing 2020 [Paper]
Annotator rationales for labeling tasks in crowdsourcing, JAIR 2020 [Paper]
Rethinking pre-training and self-training, NeurIPS 2020 [Paper]
Multi-label dataless text classification with topic modeling, KIS 2019 [Paper]
Data programming: Creating large training sets, quickly, NeurIPS 2016 [Paper]
Semi-supervised consensus labeling for crowdsourcing, SIGIR 2011 [Paper]
Vox Populi: Collecting High-Quality Labels from a Crowd, COLT 2009 [Paper]
Democratic co-learning, ICTAI 2004 [Paper]
Active learning with statistical models, JAIR 1996 [Paper]

Data Preparation

TSFEL: Time series feature extraction library, SoftwareX 2020 [Paper] [Code]
Alphaclean: Automatic generation of data cleaning pipelines, arXiv 2019 [Paper] [Code]
Introduction to Scikit-learn, Book 2019 [Paper] [Code]
Feature extraction: a survey of the types, techniques, applications, ICSC 2019 [Paper]
Feature engineering for predictive modeling using reinforcement learning, AAAI 2018 [Paper]
Time series classification from scratch with deep neural networks: A strong baseline, IIJCNN 2017 [Paper]
Missing data imputation: focusing on single imputation, ATM 2016 [Paper]
Estimating the number and sizes of fuzzy-duplicate clusters, CIKM 2014 [Paper]
Data normalization and standardization: a technical report, MLTR 2014 [Paper]
CrowdER: crowdsourcing entity resolution, VLDB 2012 [Paper]
Imputation of Missing Data Using Machine Learning Techniques, KDD 1996 [Paper]

Data Reduction

Active feature selection for the mutual information criterion, AAAI 2021 [Paper] [Code]
Active incremental feature selection using a fuzzy-rough-set-based information entropy, IEEE Transactions on Fuzzy Systems, 2020 [Paper]
MESA: boost ensemble imbalanced learning with meta-sampler, NeurIPS 2020 [Paper] [Code]
Autoencoders, arXiv 2020 [Paper]
Feature selection: A data perspective, ACM COmputer Surveys, 2017 [Paper] [Code]
Intrusion detection model using fusion of chi-square feature selection and multi class SVM, Journal of King Saud University-Computer and Information Sciences 2017 [Paper]
Feature selection and analysis on correlated gas sensor data with recursive feature elimination, Sensors and Actuators B: Chemical 2015 [Paper]
Embedded unsupervised feature selection, AAAI 2015 [Paper]
Using random undersampling to alleviate class imbalance on tweet sentiment data, ICIRI 2015 [Paper]
Feature selection based on information gain, IJITEE 2013 [Paper]
Linear discriminant analysis, Book 2013 [Paper]
Introduction to k nearest neighbour classification and condensed nearest neighbour data reduction, 2012 [Paper]
Principal component analysis, Wiley Interdisciplinary Reviews 2010 [Paper] [Code]
Finding representative patterns with ordered projections, Pattern Recognition 2003 [Paper]

Data Augmentation

Towards automated imbalanced learning with deep hierarchical reinforcement learning, CIKM 2022 [Paper] [Code]
G-Mixup: Graph Data Augmentation for Graph Classification, ICML 2022 [Paper] [Code]
Cascaded Diffusion Models for High Fidelity Image Generation, JMLR 2022 [Paper]
Time series data augmentation for deep learning: A survey, IJCAI 2021 [Paper]
Text data augmentation for deep learning, JBD 2020 [Paper]
Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification, ACL 2020 [Paper] [Code]
Autoaugment: Learning augmentation policies from data, CVPR 2019 [Paper] [Code]
Mixup: Beyond empirical risk minimization, ICLR 2018 [Paper] [Code]
Synthetic data augmentation using GAN for improved liver lesion classification, ISBI 2018 [Paper] [Code]
Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation, ASRU 2017 [Paper]
Character-level convolutional networks for text classification, NeurIPS 2015 [Paper] [Code]
ADASYN: Adaptive synthetic sampling approach for imbalanced learning, IJCNN 2008 [Paper] [Code]
SMOTE: synthetic minority over-sampling technique, JAIR 2002 [Paper] [Code]

Pipeline Search

Towards Personalized Preprocessing Pipeline Search, arXiv 2023 [Paper]
AutoVideo: An Automated Video Action Recognition System, IJCAI 2022 [Paper] [Code]
Tods: An automated time series outlier detection system, AAAI 2021 [Paper] [Code]
Deepline: Automl tool for pipelines generation using deep reinforcement learning and hierarchical actions filtering, KDD 2020 [Paper]
On evaluation of automl systems, ICML 2020 [Paper]
AlphaD3M: Machine learning pipeline synthesis, ICML 2018 [Paper]
Efficient and robust automated machine learning, NeurIPS 2015 [Paper] [Code]

Inference Data Development

In-distribution Evaluation

FOCUS: Flexible optimizable counterfactual explanations for tree ensembles, AAAI 2022 [Paper] [Code]
Sliceline: Fast, linear-algebra-based slice finding for ml model debugging, SIGMOD 2021 [Paper] [Code]
Counterfactual explanations for oblique decision trees: Exact, efficient algorithms, AAAI 2021 [Paper]
A Step Towards Global Counterfactual Explanations: Approximating the Feature Space Through Hierarchical Division and Graph Search, AAIML 2021 [Paper]
An exact counterfactual-example-based approach to tree-ensemble models interpretability, arXiv 2021 [Paper] [Code]
No subclass left behind: Fine-grained robustness in coarse-grained classification problems, NeurIPS 2020 [Paper] [Code]
FACE: feasible and actionable counterfactual explanations, AIES 2020 [Paper] [Code]
DACE: Distribution-Aware Counterfactual Explanation by Mixed-Integer Linear Optimization, IJCAI 2020 [Paper]
Multi-objective counterfactual explanations, arXiv 2020 [Paper] [Code]
Certifai: Counterfactual explanations for robustness, transparency, interpretability, and fairness of artificial intelligence models, AIES 2020 [Paper] [Code]
Propublica's compas data revisited, arXiv 2019 [Paper]
Slice finder: Automated data slicing for model validation, ICDE 2019 [Paper] [Code]
Multiaccuracy: Black-box post-processing for fairness in classification, AIES 2019 [Paper] [Code]
Model agnostic contrastive explanations for structured data, arXiv 2019 [Paper]
Counterfactual explanations without opening the black box: Automated decisions and the GDPR, Harvard Journal of Law & Technology 2018 [Paper]
Comparison-based inverse classification for interpretability in machine learning, IPMU 2018 [Paper]
Quantitative program slicing: Separating statements by relevance, ICSE 2013 [Paper]
Stratal slicing, Part II: Real 3-D seismic data, Geophysics 1998 [Paper]

Out-of-distribution Evaluation

A brief review of domain adaptation, Transactions on Computational Science and Computational Intelligenc 2021 [Paper]
Domain adaptation for medical image analysis: a survey, IEEE Transactions on Biomedical Engineering 2021 [Paper]
Retiring adult: New datasets for fair machine learning, NeurIPS 2021 [Paper] [Code]
Wilds: A benchmark of in-the-wild distribution shifts, ICML 2021 [Paper] [Code]
Do image classifiers generalize across time?, ICCV 2021 [Paper]
Using videos to evaluate image model robustness, arXiv 2019 [Paper]
Regularized learning for domain adaptation under label shifts, ICLR 2019 [Paper] [Code]
Benchmarking neural network robustness to common corruptions and perturbations, ICLR 2019 [Paper] [Code]
Towards deep learning models resistant to adversarial attacks, ICLR 2018 [Paper] [Code]
Robust physical-world attacks on deep learning visual classification, CVPR 2018 [Paper]
Detecting and correcting for label shift with black box predictors, ICML 2018 [Paper]
Poison frogs! targeted clean-label poisoning attacks on neural networks, NeurIPS 2018 [Paper] [Code]
Practical black-box attacks against machine learning, CCS 2017 [Paper]
Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models, AISec 2017 [Paper] [Code]
Deepfool: a simple and accurate method to fool deep neural networks, CVPR 2016 [Paper] [Code]
Evasion attacks against machine learning at test time, ECML PKDD 2013 [Paper] [Code]
Adapting visual category models to new domains, ECCV 2010 [Paper]
Covariate shift by kernel mean matching, MIT Press 2009 [Paper]
Covariate shift adaptation by importance weighted cross validation, JMLR 2007 [Paper]

Prompt Engineering

SPeC: A Soft Prompt-Based Calibration on Mitigating Performance Variability in Clinical Notes Summarization, arXiv 2023 [Paper]
Making Pre-trained Language Models Better Few-shot Learners, arXiv 2021 [Paper] [Code]
Bartscore: Evaluating generated text as text generation, NeurIPS 2021 [Paper] [Code]
BERTese: Learning to Speak to BERT, arXiv 2021 [Paper]
Few-shot text generation with pattern-exploiting training, arXiv 2020 [Paper]
Exploiting cloze questions for few shot text classification and natural language inference, arXiv 2020 [Paper] [Code]
It's not just size that matters: Small language models are also few-shot learners, arXiv 2020 [Paper]
How can we know what language models know?, TACL 2020 [Paper] [Code]
Universal adversarial triggers for attacking and analyzing NLP, EMNLP 2019 [Paper] [Code]

Data Maintenance

Data Understanding

The science of visual data communication: What works, Psychological Science in the Public Interest 2021 [Paper]
Towards out-of-distribution generalization: A survey, arXiv 2021 [Paper]
Snowy: Recommending utterances for conversational visual analysis, UIST 2021 [Paper]
A distributional framework for data valuation, ICML 2020 [Paper]
A comparison of radial and linear charts for visualizing daily patterns, TVCG 2020 [Paper]
A marketplace for data: An algorithmic solution, EC 2019 [Paper]
Data shapley: Equitable valuation of data for machine learning, PMLR 2019 [Paper] [Code]
Deepeye: Towards automatic data visualization, ICDE 2018 [Paper] [Code]
Voyager: Exploratory analysis via faceted browsing of visualization recommendations, TVCG 2016 [Paper]
A survey of clustering algorithms for big data: Taxonomy and empirical analysis, TETC 2014 [Paper]
On the benefits and drawbacks of radial diagrams, Handbook of Human Centric Visualization 2013 [Paper]
What makes a visualization memorable?, TVCG 2013 [Paper]
Toward a taxonomy of visuals in science communication, Technical Communication 2011 [Paper]

Data Quality Assurance

Human-AI Collaboration for Improving the Identification of Cars for Autonomous Driving, CIKM Workshop 2022 [Paper]
A Human-ML Collaboration Framework for Improving Video Content Reviews, arXiv 2022 [Paper]
Knowledge graph quality management: a comprehensive survey, TKDE 2022 [Paper]
A crowdsourcing open platform for literature curation in UniProt, PLoS Biol. 2021 [Paper] [Code]
Building data curation processes with crowd intelligence, Advanced Information Systems Engineering 2020 [Paper]
Data Curation with Deep Learning, EDBT, 2020 [Paper]
Automating large-scale data quality verification, VLDB 2018 [Paper]
Data quality: The role of empiricism, SIGMOD 2017 [Paper]
Tfx: A tensorflow-based production-scale machine learning platform, KDD 2017 [Paper] [Code]
Discovering denial constraints, VLDB 2013 [Paper] [Code]
Methodologies for data quality assessment and improvement, ACM Computing Surveys 2009 [Paper]
Conditional functional dependencies for data cleaning, ICDE 2007 [Paper]
Data quality assessment, Communications of the ACM 2002 [Paper]

Data Storage and Retrieval

Dbmind: A self-driving platform in opengauss, PVLDB 2021 [Paper]
Online index selection using deep reinforcement learning for a cluster database, ICDEW 2020 [Paper]
Bridging the semantic gap with SQL query logs in natural language interfaces to databases, ICDE 2019 [Paper]
An end-to-end learning-based cost estimator, VLDB 2019 [Paper] [Code]
An adaptive approach for index tuning with learning classifier systems on hybrid storage environments, Hybrid Artificial Intelligent Systems 2018 [Paper]
Automatic database management system tuning through large-scale machine learning, SIGMOD 2017 [Paper]
Learning to rewrite queries, CIKM 2016 [Paper]
DBridge: A program rewrite tool for set-oriented query execution, IEEE ICDE 2011 [Paper]
Starfish: A Self-tuning System for Big Data Analytics, CIDR 2011 [Paper] [Code]
DB2 advisor: An optimizer smart enough to recommend its own indexes, ICDE 2000 [Paper]
An efficient, cost-driven index selection tool for Microsoft SQL server, VLDB 1997 [Paper]

Data Benchmark

Training Data Development Benchmark

REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines, PVLDB 2023 [Paper] [Code]
Usb: A unified semi-supervised learning benchmark for classification, NeurIPS 2022 [Paper] [Code]
A feature extraction & selection benchmark for structural health monitoring, Structural Health Monitoring 2022 [Paper]
Data augmentation for deep graph learning: A survey, KDD 2022 [Paper]
Blood-based transcriptomic signature panel identification for cancer diagnosis: benchmarking of feature extraction methods, Briefings in Bioinformatics 2022 [Paper] [Code]
Amlb: an automl benchmark, arXiv 2022 [Paper]
A benchmark for data imputation methods, Front. Big Data 2021 [Paper] [Code]
Benchmark and survey of automated machine learning frameworks, JAIR 2021 [Paper]
Benchmarking differentially private synthetic data generation algorithms, arXiv 2021 [Paper]
A comprehensive benchmark framework for active learning methods in entity matching, SIGMOD 2020 [Paper]
Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy, CVPR 2020 [Paper] [Code]
Comparison of instance selection and construction methods with various classifiers, Applied Sciences 2020 [Paper]
An empirical survey of data augmentation for time series classification with neural networks, arXiv 2020 [Paper] [Code]
Toward a quantitative survey of dimension reduction techniques, IEEE Transactions on Visualization and Computer Graphics 2019 [Paper] [Code]
Cleanml: A benchmark for joint data cleaning and machine learning experiments and analysis, arXiv 2019 [Paper] [Code]
Comparison of different image data augmentation approaches, Journal of Big Data 2019 [Paper] [Code]
A benchmark and comparison of active learning for logistic regression, Pattern Recognition 2018 [Paper]
A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge, Database (Oxford). 2017 [Paper] [Data]
RODI: A benchmark for automatic mapping generation in relational-to-ontology data integration, ESWC 2015 [Paper] [Code]
TPC-DI: the first industry benchmark for data integration, PVLDB 2014 [Paper]
Comparison of instance selection algorithms II. Results and comments, ICAISC 2004 [Paper]

Inference Data Development Benchmark

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, arXiv 2023 [Paper] [Code]
Carla: a python library to benchmark algorithmic recourse and counterfactual explanation algorithms, arXiv 2021 [Paper] [Code]
Benchmarking adversarial robustness on image classification, CVPR 2020 [Paper]
Searching for a search method: Benchmarking search algorithms for generating nlp adversarial examples, ACL Workshop 2020 [Code]
Benchmarking neural network robustness to common corruptions and perturbations, ICLR 2019 [Paper] [Code] [Code]

Data Maintenance Benchmark

Chart-to-text: A large-scale benchmark for chart summarization, ACL 2022 [Paper] [Code]
Scalability vs. utility: Do we have to sacrifice one for the other in data importance quantification?, CVPR 2021 [Paper] [Code]
An evaluation-focused framework for visualization recommendation algorithms, IEEE Transactions on Visualization and Computer Graphics 2021 [Paper] [Code]
Facilitating database tuning with hyper-parameter optimization: a comprehensive experimental evaluation, VLDB 2021 [Paper] [Code]
Benchmarking Data Curation Systems, IEEE Data Eng. Bull. 2016 [Paper]
Methodologies for data quality assessment and improvement, ACM Computing Surveys 2009 [Paper]
Benchmark development for the evaluation of visualization for data mining, Information visualization in data mining and knowledge discovery 2001 [Paper]

Unified Benchmark

Dataperf: Benchmarks for data-centric AI development, arXiv 2022 [Paper]

wsgan001/data-centric-AI