/data-centric-AI

A curated, but incomplete, list of data-centric AI resources.

Awesome-Data-Centric-AI

Awesome

A curated, but incomplete, list of data-centric AI resources. It should be noted that it is unfeasible to encompass every paper. Thus, we prefer to selectively choose papers that present a range of distinct ideas. We welcome contributions to further enrich and refine this list.

📢 News: Please check out our open-sourced Large Time Series Model (LTSM)!

If you want to contribute to this list, please feel free to send a pull request. Also, you can contact daochen.zha@rice.edu.

Want to discuss with others who are also interested in data-centric AI? There are three options:

  • Join our Slack channel
  • Join our QQ group (183116457). Password: datacentric
  • Join the WeChat group below (if the QR code is expired, please add WeChat ID: zdcwhu and add a note indicating that you want to join the Data-centric AI group)!

group

What is Data-centric AI?

Data-centric AI is an emerging field that focuses on engineering data to improve AI systems with enhanced data quality and quantity.

Data-centric AI vs. Model-centric AI

data-centric

In the conventional model-centric AI lifecycle, researchers and developers primarily focus on identifying more effective models to improve AI performance while keeping the data largely unchanged. However, this model-centric paradigm overlooks the potential quality issues and undesirable flaws of data, such as missing values, incorrect labels, and anomalies. Complementing the existing efforts in model advancement, data-centric AI emphasizes the systematic engineering of data to build AI systems, shifting our focus from model to data.

It is important to note that "data-centric" differs fundamentally from "data-driven", as the latter only emphasizes the use of data to guide AI development, which typically still centers on developing models rather than engineering data.

Why Data-centric AI?

motivation

Two motivating examples of GPT models highlight the central role of data in AI.

  • On the left, large and high-quality training data are the driving force of recent successes of GPT models, while model architectures remain similar, except for more model weights.
  • On the right, when the model becomes sufficiently powerful, we only need to engineer prompts (inference data) to accomplish our objectives, with the model being fixed.

Another example is Segment Anything, a foundation model for computer vision. The core of training Segment Anything lies in the large amount of annotated data, containing more than 1 billion masks, which is 400 times larger than existing segmentation datasets.

What is the Data-centric AI Framework?

framework

Data-centric AI framework consists of three goals: training data development, inference data development, and data maintenance, where each goal is associated with several sub-goals.

  • The goal of training data development is to collect and produce rich and high-quality training data to support the training of machine learning models.
  • The objective of inference data development is to create novel evaluation sets that can provide more granular insights into the model or trigger a specific capability of the model with engineered data inputs.
  • The purpose of data maintenance is to ensure the quality and reliability of data in a dynamic environment.

Cite this Work

Zha, Daochen, et al. "Data-centric Artificial Intelligence: A Survey." arXiv preprint arXiv:2303.10158, 2023.

@article{zha2023data-centric-survey,
  title={Data-centric Artificial Intelligence: A Survey},
  author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Jiang, Zhimeng and Zhong, Shaochen and Hu, Xia},
  journal={arXiv preprint arXiv:2303.10158},
  year={2023}
}

Zha, Daochen, et al. "Data-centric AI: Perspectives and Challenges." SDM, 2023.

@inproceedings{zha2023data-centric-perspectives,
  title={Data-centric AI: Perspectives and Challenges},
  author={Zha, Daochen and Bhat, Zaid Pervaiz and Lai, Kwei-Herng and Yang, Fan and Hu, Xia},
  booktitle={SDM},
  year={2023}
}

Table of Contents

Training Data Development

training-data-development

Data Collection

  • Revisiting time series outlier detection: Definitions and benchmarks, NeurIPS 2021 [Paper] [Code]
  • Dataset discovery in data lakes, ICDE 2020 [Paper]
  • Aurum: A data discovery system, ICDE 2018 [Paper] [Code]
  • Table union search on open data, VLDB 2018 [Paper]
  • Data Integration: The Current Status and the Way Forward, IEEE Computer Society Technical Committee on Data Engineering 2018 [Paper]
  • To join or not to join? thinking twice about joins before feature selection, SIGMOD 2016 [Paper]
  • Data curation at scale: the data tamer system, CIDR 2013 [Paper]
  • Data integration: A theoretical perspective, PODS 2002 [Paper]

Data Labeling

  • Segment Anything [Paper] [code]
  • Active Ensemble Learning for Knowledge Graph Error Detection, WSDM 2023 [Paper]
  • Active-Learning-as-a-Service: An Efficient MLOps System for Data-Centric AI, NeurIPS 2022 Workshop on Human in the Loop Learning [paper] [code]
  • Training language models to follow instructions with human feedback, NeurIPS 2022 [Paper]
  • Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling, ICLR 2021 [Paper] [Code]
  • A survey of deep active learning, ACM Computing Surveys 2021 [Paper]
  • Adaptive rule discovery for labeling text data, SIGMOD 2021 [Paper]
  • Cut out the annotator, keep the cutout: better segmentation with weak supervision, ICLR 2021 [Paper]
  • Meta-AAD: Active anomaly detection with deep reinforcement learning, ICDM 2020 [Paper] [Code]
  • Snorkel: Rapid training data creation with weak supervision, VLDB 2020 [Paper] [Code]
  • Graph-based semi-supervised learning: A review, Neurocomputing 2020 [Paper]
  • Annotator rationales for labeling tasks in crowdsourcing, JAIR 2020 [Paper]
  • Rethinking pre-training and self-training, NeurIPS 2020 [Paper]
  • Multi-label dataless text classification with topic modeling, KIS 2019 [Paper]
  • Data programming: Creating large training sets, quickly, NeurIPS 2016 [Paper]
  • Semi-supervised consensus labeling for crowdsourcing, SIGIR 2011 [Paper]
  • Vox Populi: Collecting High-Quality Labels from a Crowd, COLT 2009 [Paper]
  • Democratic co-learning, ICTAI 2004 [Paper]
  • Active learning with statistical models, JAIR 1996 [Paper]

Data Preparation

  • DataFix: Adversarial Learning for Feature Shift Detection and Correction, NeurIPS 2023 [Paper] [Code]
  • OpenGSL: A Comprehensive Benchmark for Graph Structure Learning, arXiv 2023 [Paper] [Code]
  • TSFEL: Time series feature extraction library, SoftwareX 2020 [Paper] [Code]
  • Alphaclean: Automatic generation of data cleaning pipelines, arXiv 2019 [Paper] [Code]
  • Introduction to Scikit-learn, Book 2019 [Paper] [Code]
  • Feature extraction: a survey of the types, techniques, applications, ICSC 2019 [Paper]
  • Feature engineering for predictive modeling using reinforcement learning, AAAI 2018 [Paper]
  • Time series classification from scratch with deep neural networks: A strong baseline, IIJCNN 2017 [Paper]
  • Missing data imputation: focusing on single imputation, ATM 2016 [Paper]
  • Estimating the number and sizes of fuzzy-duplicate clusters, CIKM 2014 [Paper]
  • Data normalization and standardization: a technical report, MLTR 2014 [Paper]
  • CrowdER: crowdsourcing entity resolution, VLDB 2012 [Paper]
  • Imputation of Missing Data Using Machine Learning Techniques, KDD 1996 [Paper]

Data Reduction

  • Active feature selection for the mutual information criterion, AAAI 2021 [Paper] [Code]
  • Active incremental feature selection using a fuzzy-rough-set-based information entropy, IEEE Transactions on Fuzzy Systems, 2020 [Paper]
  • MESA: boost ensemble imbalanced learning with meta-sampler, NeurIPS 2020 [Paper] [Code]
  • Autoencoders, arXiv 2020 [Paper]
  • Feature selection: A data perspective, ACM COmputer Surveys, 2017 [Paper] [Code]
  • Intrusion detection model using fusion of chi-square feature selection and multi class SVM, Journal of King Saud University-Computer and Information Sciences 2017 [Paper]
  • Feature selection and analysis on correlated gas sensor data with recursive feature elimination, Sensors and Actuators B: Chemical 2015 [Paper]
  • Embedded unsupervised feature selection, AAAI 2015 [Paper]
  • Using random undersampling to alleviate class imbalance on tweet sentiment data, ICIRI 2015 [Paper]
  • Feature selection based on information gain, IJITEE 2013 [Paper]
  • Linear discriminant analysis, Book 2013 [Paper]
  • Introduction to k nearest neighbour classification and condensed nearest neighbour data reduction, 2012 [Paper]
  • Principal component analysis, Wiley Interdisciplinary Reviews 2010 [Paper] [Code]
  • Finding representative patterns with ordered projections, Pattern Recognition 2003 [Paper]

Data Augmentation

  • Towards automated imbalanced learning with deep hierarchical reinforcement learning, CIKM 2022 [Paper] [Code]
  • G-Mixup: Graph Data Augmentation for Graph Classification, ICML 2022 [Paper] [Code]
  • Cascaded Diffusion Models for High Fidelity Image Generation, JMLR 2022 [Paper]
  • Time series data augmentation for deep learning: A survey, IJCAI 2021 [Paper]
  • Text data augmentation for deep learning, JBD 2020 [Paper]
  • Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification, ACL 2020 [Paper] [Code]
  • Autoaugment: Learning augmentation policies from data, CVPR 2019 [Paper] [Code]
  • Mixup: Beyond empirical risk minimization, ICLR 2018 [Paper] [Code]
  • Synthetic data augmentation using GAN for improved liver lesion classification, ISBI 2018 [Paper] [Code]
  • Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation, ASRU 2017 [Paper]
  • Character-level convolutional networks for text classification, NeurIPS 2015 [Paper] [Code]
  • ADASYN: Adaptive synthetic sampling approach for imbalanced learning, IJCNN 2008 [Paper] [Code]
  • SMOTE: synthetic minority over-sampling technique, JAIR 2002 [Paper] [Code]

Pipeline Search

  • Towards Personalized Preprocessing Pipeline Search, arXiv 2023 [Paper]
  • AutoVideo: An Automated Video Action Recognition System, IJCAI 2022 [Paper] [Code]
  • Tods: An automated time series outlier detection system, AAAI 2021 [Paper] [Code]
  • Deepline: Automl tool for pipelines generation using deep reinforcement learning and hierarchical actions filtering, KDD 2020 [Paper]
  • On evaluation of automl systems, ICML 2020 [Paper]
  • AlphaD3M: Machine learning pipeline synthesis, ICML 2018 [Paper]
  • Efficient and robust automated machine learning, NeurIPS 2015 [Paper] [Code]
  • Tiny3D: A Data-Centric AI based 3D Object Detection Service Production System [Code]
  • Learning From How Humans Correct [Paper]
  • The Re-Label Method For Data-Centric Machine Learning [Paper]
  • Automatic Label Error Correction [Paper] [Code]

Inference Data Development

inference-data-development

In-distribution Evaluation

  • FOCUS: Flexible optimizable counterfactual explanations for tree ensembles, AAAI 2022 [Paper] [Code]
  • Sliceline: Fast, linear-algebra-based slice finding for ml model debugging, SIGMOD 2021 [Paper] [Code]
  • Counterfactual explanations for oblique decision trees: Exact, efficient algorithms, AAAI 2021 [Paper]
  • A Step Towards Global Counterfactual Explanations: Approximating the Feature Space Through Hierarchical Division and Graph Search, AAIML 2021 [Paper]
  • An exact counterfactual-example-based approach to tree-ensemble models interpretability, arXiv 2021 [Paper] [Code]
  • No subclass left behind: Fine-grained robustness in coarse-grained classification problems, NeurIPS 2020 [Paper] [Code]
  • FACE: feasible and actionable counterfactual explanations, AIES 2020 [Paper] [Code]
  • DACE: Distribution-Aware Counterfactual Explanation by Mixed-Integer Linear Optimization, IJCAI 2020 [Paper]
  • Multi-objective counterfactual explanations, arXiv 2020 [Paper] [Code]
  • Certifai: Counterfactual explanations for robustness, transparency, interpretability, and fairness of artificial intelligence models, AIES 2020 [Paper] [Code]
  • Propublica's compas data revisited, arXiv 2019 [Paper]
  • Slice finder: Automated data slicing for model validation, ICDE 2019 [Paper] [Code]
  • Multiaccuracy: Black-box post-processing for fairness in classification, AIES 2019 [Paper] [Code]
  • Model agnostic contrastive explanations for structured data, arXiv 2019 [Paper]
  • Counterfactual explanations without opening the black box: Automated decisions and the GDPR, Harvard Journal of Law & Technology 2018 [Paper]
  • Comparison-based inverse classification for interpretability in machine learning, IPMU 2018 [Paper]
  • Quantitative program slicing: Separating statements by relevance, ICSE 2013 [Paper]
  • Stratal slicing, Part II: Real 3-D seismic data, Geophysics 1998 [Paper]

Out-of-distribution Evaluation

  • A brief review of domain adaptation, Transactions on Computational Science and Computational Intelligenc 2021 [Paper]
  • Domain adaptation for medical image analysis: a survey, IEEE Transactions on Biomedical Engineering 2021 [Paper]
  • Retiring adult: New datasets for fair machine learning, NeurIPS 2021 [Paper] [Code]
  • Wilds: A benchmark of in-the-wild distribution shifts, ICML 2021 [Paper] [Code]
  • Do image classifiers generalize across time?, ICCV 2021 [Paper]
  • Using videos to evaluate image model robustness, arXiv 2019 [Paper]
  • Regularized learning for domain adaptation under label shifts, ICLR 2019 [Paper] [Code]
  • Benchmarking neural network robustness to common corruptions and perturbations, ICLR 2019 [Paper] [Code]
  • Towards deep learning models resistant to adversarial attacks, ICLR 2018 [Paper] [Code]
  • Robust physical-world attacks on deep learning visual classification, CVPR 2018 [Paper]
  • Detecting and correcting for label shift with black box predictors, ICML 2018 [Paper]
  • Poison frogs! targeted clean-label poisoning attacks on neural networks, NeurIPS 2018 [Paper] [Code]
  • Practical black-box attacks against machine learning, CCS 2017 [Paper]
  • Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models, AISec 2017 [Paper] [Code]
  • Deepfool: a simple and accurate method to fool deep neural networks, CVPR 2016 [Paper] [Code]
  • Evasion attacks against machine learning at test time, ECML PKDD 2013 [Paper] [Code]
  • Adapting visual category models to new domains, ECCV 2010 [Paper]
  • Covariate shift by kernel mean matching, MIT Press 2009 [Paper]
  • Covariate shift adaptation by importance weighted cross validation, JMLR 2007 [Paper]

Prompt Engineering

  • SPeC: A Soft Prompt-Based Calibration on Mitigating Performance Variability in Clinical Notes Summarization, arXiv 2023 [Paper]
  • Making Pre-trained Language Models Better Few-shot Learners, arXiv 2021 [Paper] [Code]
  • Bartscore: Evaluating generated text as text generation, NeurIPS 2021 [Paper] [Code]
  • BERTese: Learning to Speak to BERT, arXiv 2021 [Paper]
  • Few-shot text generation with pattern-exploiting training, arXiv 2020 [Paper]
  • Exploiting cloze questions for few shot text classification and natural language inference, arXiv 2020 [Paper] [Code]
  • It's not just size that matters: Small language models are also few-shot learners, arXiv 2020 [Paper]
  • How can we know what language models know?, TACL 2020 [Paper] [Code]
  • Universal adversarial triggers for attacking and analyzing NLP, EMNLP 2019 [Paper] [Code]

Data Maintenance

data-maintenance

Data Understanding

  • The science of visual data communication: What works, Psychological Science in the Public Interest 2021 [Paper]
  • Towards out-of-distribution generalization: A survey, arXiv 2021 [Paper]
  • Snowy: Recommending utterances for conversational visual analysis, UIST 2021 [Paper]
  • A distributional framework for data valuation, ICML 2020 [Paper]
  • A comparison of radial and linear charts for visualizing daily patterns, TVCG 2020 [Paper]
  • A marketplace for data: An algorithmic solution, EC 2019 [Paper]
  • Data shapley: Equitable valuation of data for machine learning, PMLR 2019 [Paper] [Code]
  • Deepeye: Towards automatic data visualization, ICDE 2018 [Paper] [Code]
  • Voyager: Exploratory analysis via faceted browsing of visualization recommendations, TVCG 2016 [Paper]
  • A survey of clustering algorithms for big data: Taxonomy and empirical analysis, TETC 2014 [Paper]
  • On the benefits and drawbacks of radial diagrams, Handbook of Human Centric Visualization 2013 [Paper]
  • What makes a visualization memorable?, TVCG 2013 [Paper]
  • Toward a taxonomy of visuals in science communication, Technical Communication 2011 [Paper]

Data Quality Assurance

  • Human-AI Collaboration for Improving the Identification of Cars for Autonomous Driving, CIKM Workshop 2022 [Paper]
  • A Human-ML Collaboration Framework for Improving Video Content Reviews, arXiv 2022 [Paper]
  • Knowledge graph quality management: a comprehensive survey, TKDE 2022 [Paper]
  • A crowdsourcing open platform for literature curation in UniProt, PLoS Biol. 2021 [Paper] [Code]
  • Building data curation processes with crowd intelligence, Advanced Information Systems Engineering 2020 [Paper]
  • Data Curation with Deep Learning, EDBT, 2020 [Paper]
  • Automating large-scale data quality verification, VLDB 2018 [Paper]
  • Data quality: The role of empiricism, SIGMOD 2017 [Paper]
  • Tfx: A tensorflow-based production-scale machine learning platform, KDD 2017 [Paper] [Code]
  • Discovering denial constraints, VLDB 2013 [Paper] [Code]
  • Methodologies for data quality assessment and improvement, ACM Computing Surveys 2009 [Paper]
  • Conditional functional dependencies for data cleaning, ICDE 2007 [Paper]
  • Data quality assessment, Communications of the ACM 2002 [Paper]

Data Storage and Retrieval

  • Dbmind: A self-driving platform in opengauss, PVLDB 2021 [Paper]
  • Online index selection using deep reinforcement learning for a cluster database, ICDEW 2020 [Paper]
  • Bridging the semantic gap with SQL query logs in natural language interfaces to databases, ICDE 2019 [Paper]
  • An end-to-end learning-based cost estimator, VLDB 2019 [Paper] [Code]
  • An adaptive approach for index tuning with learning classifier systems on hybrid storage environments, Hybrid Artificial Intelligent Systems 2018 [Paper]
  • Automatic database management system tuning through large-scale machine learning, SIGMOD 2017 [Paper]
  • Learning to rewrite queries, CIKM 2016 [Paper]
  • DBridge: A program rewrite tool for set-oriented query execution, IEEE ICDE 2011 [Paper]
  • Starfish: A Self-tuning System for Big Data Analytics, CIDR 2011 [Paper] [Code]
  • DB2 advisor: An optimizer smart enough to recommend its own indexes, ICDE 2000 [Paper]
  • An efficient, cost-driven index selection tool for Microsoft SQL server, VLDB 1997 [Paper]

Data Benchmark

Training Data Development Benchmark

  • OpenGSL: A Comprehensive Benchmark for Graph Structure Learning, arXiv 2023 [Paper] [Code]
  • REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines, EDBT 2023 [Paper] [Code]
  • Usb: A unified semi-supervised learning benchmark for classification, NeurIPS 2022 [Paper] [Code]
  • A feature extraction & selection benchmark for structural health monitoring, Structural Health Monitoring 2022 [Paper]
  • Data augmentation for deep graph learning: A survey, KDD 2022 [Paper]
  • Blood-based transcriptomic signature panel identification for cancer diagnosis: benchmarking of feature extraction methods, Briefings in Bioinformatics 2022 [Paper] [Code]
  • Amlb: an automl benchmark, arXiv 2022 [Paper]
  • A benchmark for data imputation methods, Front. Big Data 2021 [Paper] [Code]
  • Benchmark and survey of automated machine learning frameworks, JAIR 2021 [Paper]
  • Benchmarking differentially private synthetic data generation algorithms, arXiv 2021 [Paper]
  • A comprehensive benchmark framework for active learning methods in entity matching, SIGMOD 2020 [Paper]
  • Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy, CVPR 2020 [Paper] [Code]
  • Comparison of instance selection and construction methods with various classifiers, Applied Sciences 2020 [Paper]
  • An empirical survey of data augmentation for time series classification with neural networks, arXiv 2020 [Paper] [Code]
  • Toward a quantitative survey of dimension reduction techniques, IEEE Transactions on Visualization and Computer Graphics 2019 [Paper] [Code]
  • Cleanml: A benchmark for joint data cleaning and machine learning experiments and analysis, arXiv 2019 [Paper] [Code]
  • Comparison of different image data augmentation approaches, Journal of Big Data 2019 [Paper] [Code]
  • A benchmark and comparison of active learning for logistic regression, Pattern Recognition 2018 [Paper]
  • A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge, Database (Oxford). 2017 [Paper] [Data]
  • RODI: A benchmark for automatic mapping generation in relational-to-ontology data integration, ESWC 2015 [Paper] [Code]
  • TPC-DI: the first industry benchmark for data integration, PVLDB 2014 [Paper]
  • Comparison of instance selection algorithms II. Results and comments, ICAISC 2004 [Paper]

Inference Data Development Benchmark

  • Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, arXiv 2023 [Paper] [Code]
  • Carla: a python library to benchmark algorithmic recourse and counterfactual explanation algorithms, arXiv 2021 [Paper] [Code]
  • Benchmarking adversarial robustness on image classification, CVPR 2020 [Paper]
  • Searching for a search method: Benchmarking search algorithms for generating nlp adversarial examples, ACL Workshop 2020 [Code]
  • Benchmarking neural network robustness to common corruptions and perturbations, ICLR 2019 [Paper] [Code] [Code]

Data Maintenance Benchmark

  • Chart-to-text: A large-scale benchmark for chart summarization, ACL 2022 [Paper] [Code]
  • Scalability vs. utility: Do we have to sacrifice one for the other in data importance quantification?, CVPR 2021 [Paper] [Code]
  • An evaluation-focused framework for visualization recommendation algorithms, IEEE Transactions on Visualization and Computer Graphics 2021 [Paper] [Code]
  • Facilitating database tuning with hyper-parameter optimization: a comprehensive experimental evaluation, VLDB 2021 [Paper] [Code]
  • Benchmarking Data Curation Systems, IEEE Data Eng. Bull. 2016 [Paper]
  • Methodologies for data quality assessment and improvement, ACM Computing Surveys 2009 [Paper]
  • Benchmark development for the evaluation of visualization for data mining, Information visualization in data mining and knowledge discovery 2001 [Paper]

Unified Benchmark

  • Dataperf: Benchmarks for data-centric AI development, arXiv 2022 [Paper]