/awesome-ml-data-quality-papers

Papers about training data quality management for ML models.

Awesome ML Data Quality Papers

This is a list of papers about training data quality management for ML models.

Introduction

Data scientists spend ∼80% time on data preparation for an ML pipeline since the data quality issues are unknown beforehand thereby leading to iterative debugging [1]. A good Data Quality Management System for ML (DQMS for ML) helps data scientists break free from the arduous process of data selection and debugging, particularly in the current era of big data and large models. Automating the management of training data quality effectively is crucial for improving the efficiency and quality of ML pipelines.

With the emergence and development of "Data-Centric AI", there has been increasing research focus on optimizing the quality of training data rather than solely concentrating on model structures and training strategies. This is the motivation behind maintaining this repository.

Before we proceed, let's define data quality for ML. In contrast to traditional data cleaning, training data quality for ML refers to the impact of individual or groups of data samples on the behavior of ML models for a given task. It's important to note that the behavior of the model we are concerned with goes beyond performance metrics like accuracy, recall, AUC, MSE, etc. We also consider more generalizable metrics such as model fairness, robustness, and so on.

Considering the following pipeline, DQMS acts as a middleware between data, ML model, and user, necessitating interactions with each of them.

A DQMS for ML typically consists of three components: Data Sculptor [2], Data Attributer, and Data Profiler. To achieve a well-performing ML model, multiple rounds of training are often required. In this process, the DQMS needs to iteratively adjust the training data based on the results of each round of model training. The workflow of DQMS in one round of training is as follows: (a) Data sculptor first acquires the training dataset from a data source and trains the ML model with it. (b) After training for one round (several epochs), Data Attributer absorbs feedback from the model and user's task requirements and computes the data quality assessment. (c) Data Profiler then provides a user-friendly summary of the training data. (d) Meanwhile, Data Sculptor utilizes the data quality assessment as feedback to acquire higher-quality training data, thus initiating a new iteration.

We collect the recent influential papers about DQMS for ML and annotate the relevant DQMS components involved in these papers, where DS = Data Sculptor, DA = Data Attributer, and DP = Data Profiler. The following papers are listed in chronological order of publication.

Paper List

2024

Venue Paper Links Tags TLDR
KDD'24 EcoVal: An Efficient Data Valuation Framework for Machine Learning DA
KDD'24 Scalable Rule Lists Learning with Sampling DP
KDD'24 CURLS: Causal Rule Learning for Subgroups with Significant Treatment Effect DP
arXiv'24 What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions paper DA
arXiv'24 CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning paper DA
arXiv'24 2D-OOB: Attributing Data Contribution through Joint Valuation Framework paper code DA
ICML'24 Scaling Laws for the Value of Individual Data Points in Machine Learning paper code DA This work proposes individual scaling law for distinguishing how the marginal contribution of a data point varies as the dataset size growing. It then proposes two methods to estimate the individual scaling law.
ICML'24 Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits paper DA DS In general cases without considering the structural assumptions of utility functions, Data Shapley’s performance in data selection tasks can be no better than that of random guessing. It proposes a heuristic for predicting Data Shapley’s optimality for data selection.
ICML'24 Incorporating Information into Shapley Values: Reweighting via a Maximum Entropy Approach paper DA
ICML'24 Distributionally Robust Data Valuation paper code DA
ICML'24 Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions paper code DA It proves that Shapley value shows better robustness compared to LOO and proposes FreeShap to estimate Shapley using eNTK without retraining.
ICML'24 Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset Selection paper code DA
ICML'24 Optimal Coresets for Low-Dimensional Geometric Median paper DS
ICML'24 No Dimensional Sampling Coresets for Classification DS
ICML'24 Coresets for Multiple $ℓ_𝑝$ Regression DS
ICML'24 Deletion-Anticipative Data Acquisition DS
ICML'24 Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond DS
ICML'24 Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance Constraints paper github DS
ICML'24 Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary paper DS This work proposes to select a coreset that maintains the decision boundary of model trained on full dataset. It measures the distance between a sample to its nearest decision boundary and selects data based on this distance.
ICML'24 DsDm: Dataset Selection with Datamodels DS DsDm converts the data selection problem into loss minimization problem in target data. It then uses linear datamodel to approximate the loss mapping and select the bottom-k samples with smallest estimated loss.
ICML'24 BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges DS
ICML'24 LESS: Selecting Influential Data for Targeted Instruction Tuning paper code DS
ICML'24 Exploiting Negative Samples: A Catalyst for Cohort Discovery in Healthcare Analytics paper DA DP This work proposes to leverage data Shapley value to value each data in negative sample, and employs manifold learning and clustering to find influential patterns in negative samples.
CVPR'24 The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes paper DS
arXiv'24 Training Data Attribution via Approximate Unrolled Differentiation DA
VLDB'24 P-Shapley: Shapley Values on Probabilistic Classifiers paper code DA This paper introduces P-Shapley with raw probability (instead of accuracy) as utility function and proposes calibration function to enlarge the utility change when the predicted probability is high.
VLDB'24 MetaStore: Analyzing Deep Learning Meta-Data at Scale paper DA DP
VLDB'24 Optimizing Data Acquisition to Enhance Machine Learning Performance paper code DS
VLDB'24 MisDetect: Iterative Mislabel Detection using Early Loss paper code DA
SIGMOD'24 Data Acquisition for Improving Model Confidence paper DS
SIGMOD'24 Controllable Tabular Data Synthesis Using Diffusion Models DS
SIGMOD'24 Fast Shapley Value Computation in Data Assemblage Tasks as Cooperative Simple Games paper code DA It assigns a Shapley score for data owners and their corresponding datasets in data market.
WWW'24 Exploring Neural Scaling Law and Data Pruning Methods For Node Classification on Large-scale Graphs paper code DS This work selects training nodes that are similar to test nodes by minimizing their bottleneck distance. To avoid bias caused by trivial selection, it uses a greedy alg. to assure the representativeness of selected nodes.
AAAI'24 Quality-Diversity Generative Sampling for Learning with Synthetic Data paper code DS
AAAI'24 Approximating the Shapley Value without Marginal Contributions paper DA It transfer Shapley value by $\phi_i = \phi_i^+ + \phi_i^-$. It samples coalitions and update $\phi_i^+$ and $\phi_i^-$ separately.
WSDM'24 FairIF: Boosting Fairness in Deep Learning via Influence Functions with Validation Set Sensitive Attributes paper DA
WSDM'24 Efficient, Direct, and Restricted Black-Box Graph Evasion Attacks to Any-Layer Graph Neural Networks via Influence Function paper code DA
ICLR'24 "What Data Benefits My Classifier?" Enhancing Model Performance and Interpretability through Influence-Based Data Selection paper code DS DA It extends influence function considering utility, fairness and robustness. It trains a decision tree to further estimate and interpret the influence score.
ICLR'24 Canonpipe: Data Debugging with Shapley Importance over Machine Learning Pipelines paper code DA It explores data valuation on raw data before preprocessing. It uses data provenance in ML pipelines and proposes data Shapley under a KNN approximation.
ICLR'24 Time Travel in LLMs: Tracing Data Contamination in Large Language Models paper code DA Data contamination means the presence of test data from downstream tasks in the pre-training data of LLMs. This work explore both instance and partition level methods to identify potential contamination.
ICLR'24 GIO: Gradient Information Optimization for Training Dataset Selection paper code DA GIO selects a small subset of data from large source data by minimizing the KL divergence between the target distribution and subset.
ICLR'24 Intriguing Properties of Data Attribution on Diffusion Models paper code DA This paper proposes D-TRAK to attribute images generated by diffusion models back to the training data.
ICLR'24 D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning paper code DS A data pruning method that takes diversity into consideration. It is implemented by forward and reverse message passing in the KNN graph.
ICLR'24 Effective pruning of web-scale datasets based on complexity of concept clusters paper code DS
ICLR'24 Towards a statistical theory of data selection under weak supervision paper DS
ICLR'24 Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs paper code DS
ICLR'24 DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models paper code DA DataInf approximate influence function by swapping the order of the matrix inversion and average calculation.
ICLR'24 What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning paper DS
ICLR'24 Real-Fake: Effective Training Data Synthesis Through Distribution Matching paper code DS
ICLR'24 InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning paper code DS InfoBatch uses training loss to prune well-learned samples in each epoch and estimate gradient distribution for unbiased learning.
arXiv'24 A Decade's Battle on Dataset Bias: Are We There Yet? paper code DS
arXiv'24 Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities paper code DS It uses generative AI for augmentation, ensuring that the generated data covering the original data distribution with a smallest size.
arXiv'24 On the Cause of Unfairness: A Training Sample Perspective paper DA The fairness influence can be computed by replacing the training sample with its concept counterfactual sample.

2023

Venue Paper Links Tags TLDR
arXiv'23 Accelerated Shapley Value Approximation for Data Evaluation paper DA Not all coalition sizes are evaluated, small coalitions may introduce noise and large ones may have little contributions. To estimate the effect of coalitions with size k, about O(1 / k^2) sample coalitions is sufficient.
arXiv'23 The Journey, Not the Destination: How Data Guides Diffusion Models paper code DA -
NIPS'23 The Memory Perturbation Equation: Understanding Model’s Sensitivity to Data paper code DA DP -
NIPS'23 Theoretical and Practical Perspectives on what Influence Functions Do paper DA This work discusses some problematic assumptions of IF. While most of them can be addressed, IF can predict perturbated param accurately for a limited amount of time-steps.
NIPS'23 Data Selection for Language Models via Importance Resampling paper code DS DA It selects data satisfying a target distribution from raw data by reducing KL divergence to the target over random selection.
NIPS'23 Model Shapley: Equitable Model Valuation with Black-box Access paper code DA -
NIPS'23 Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation paper DA Extend KNN-Shapley while considering data privacy.
NIPS'23 GEX: A flexible method for approximating influence via Geometric Ensemble paper code DA -
NIPS'23 Efficient Data Subset Selection to Generalize Training Across Models: Transductive and Inductive Networks paper code DS -
NIPS'23 Data Pruning via Moving-one-Sample-out paper DS This work proposes a Moso score (similar to LOO) and an approximates it using gradient over all training epochs.
NIPS'23 Towards Free Data Selection with General-Purpose Models paper code DS -
NIPS'23 Towards Accelerated Model Training via Bayesian Data Selection paper DS -
NIPS'23 Robust Data Valuation with Weighted Banzhaf Values paper DA -
NIPS'23 UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models paper DS -
NIPS'23 Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources paper code DS Given publicly known pilot data from different data sources, it returns the optimal combination of data sources.
NIPS'23 Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy paper code DS -
NIPS'23 Spuriosity Rankings: Sorting Data to Measure and Mitigate Biases paper DS -
NIPS'23 Core-sets for Fair and Diverse Data Summarization paper code DS DP It selects a fixed size of coreset for different groups of data while preserving diversity.
NIPS'23 Retaining Beneficial Information from Detrimental Data for Neural Network Repair paper DS -
NIPS'23 Expanding Small-Scale Datasets with Guided Imagination paper code DS -
NIPS'23 Error Discovery By Clustering Influence Embeddings paper code DA This work cluster influence embedding (a low dimension of influence vector of training samples) for all test samples to summarize the prediction error.
NIPS'23 HiBug: On Human-Interpretable Model Debug paper code DP DS -
NIPS'23 Skill-it! A data-driven skills framework for understanding and training language models paper code DP DS -
ICML'23 Discover and Cure: Concept-aware Mitigation of Spurious Correlation paper code DS DA Discover spurious correlation from concept level and perform concept-based data augmentation to mitigate bias.
ICML'23 TRAK: Attributing Model Behavior at Scale paper code DA TRAK first defines a Newton approximation to estimate LOO for logistic regression and then extends it to NNs (including CLIP, mT5) by view them as the linear model acting on input gradient.
ICML'23 RGE: A Repulsive Graph Rectification for Node Classification via Influence paper code DA RGE identifies a group of negative edges that are most harmful for GNNs. It iteratively selects negative edges by their individual influence and prefers distant edges first.
ICML'23 Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value paper code DA Data-OOB measures the average score when a datum (OOB data) is not selected in the bootstrap dataset.
ICML'23 2D-Shapley: A Framework for Fragmented Data Valuation paper DA
ICML'23 Towards Sustainable Learning: Coresets for Data-efficient Deep Learning paper code DS -
ICML'23 Workshop Training on Thin Air: Improve Image Classification with Generated Data paper DS -
ICML'23 Workshop Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation paper code DA DS -
VLDB'23 Equitable Data Valuation Meets the Right to Be Forgotten in Model Markets paper code DA -
VLDB'23 Computing Rule-Based Explanations by Leveraging Counterfactuals paper code DP -
VLDB'23 Data Collection and Quality Challenges for Deep Learning paper DS DA -
SIGMOD'23 GoodCore: Coreset Selection over Incomplete Data for Data-effective and Data-efficient Machine Learning paper DS GoodCore selects a coreset that achieves expected low gradient approximation error among all possible worlds of missing data.
SIGMOD'23 XInsight: eXplainable Data Analysis Through The Lens of Causality paper DP -
SIGMOD'23 HybridPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation paper code DS DP -
arXiv'23 Simfluence: Modeling the influence of individual training examples by simulating training runs paper DS Trains a simulator that generates a time series that predicts what the loss on $z_{test}$ would be after each step of the training run (a loss trajectory).
ICLR'23 Data Valuation Without Training of a Model paper code DA It proposes a score to measures the gap in data complexity where a certain data instance is removed from the full dataset.
ICLR'23 Distilling Model Failures as Directions in Latent Space paper code DS DP -
ICLR'23 LAVA: Data Valuation without Pre-Specified Learning Algorithms paper code DA LAVA uses a Wasserstein distance to estimate the upper bound of test performance. It values a training sample by its sensitivity to the distance.
ICLR'23 Concept-level Debugging of Part-Prototype Networks paper code DP -
ICLR'23 Dataset Pruning: Reducing Training Data by Examining Generalization Influence paper DS -
ICLR'23 Moderate Coreset: A Universal Method of Data Selection for Real-world Data-efficient Deep Learning paper code DS -
ICLR'23 Learning to Estimate Shapley Values with Vision Transformers paper code DA -
ICLR'23 Characterizing the Influence of Graph Elements paper code DA Introduce influence function into graphs, considering node- and edge-removal influence and the linear SGC model.
ICLR'23 Dataset pruning: Reducing training data by examining generalization influence. paper DA
ICDE'23 Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise paper code DP -
ICDE'23 Detection of Groups with Biased Representation in Ranking paper DA -
AAAI'23 Fundamentals of Task-Agnostic Data Valuation paper DA -
AAAI'23 Interpreting Unfairness in Graph Neural Networks via Training Node Attribution paper code DA This work proposes a Probabilistic Distribution Disparity to define node-contributed model bias and use gradient approximation to estimate node-level bias.
AAAI'23 Interpreting Unfairness in Graph Neural Networks via Training Node Attribution paper code DA
WWW'23 GIF: A General Graph Unlearning Strategy via Influence Function paper code DA GIF extends influence function to graph data by considering both the directly affected node(s) and the influenced neighborhoods.
AISTATS'23 Data Banzhaf: A Robust Data Valuation Framework for Machine Learning paper DA -
arXiv'23 Data-Juicer: A One-Stop Data Processing System for Large Language Models paper code DS DP -
arXiv'23 Simfluence: Modeling the influence of individual training examples by simulating training runs paper DA
arXiv'23 Studying Large Language Model Generalization with Influence Functions paper DA -
TMLR'23 Synthetic Data from Diffusion Models Improves ImageNet Classification paper DS -

2022

Venue Paper Links Tags TLDR
NIPS'22 CS-SHAPLEY: Class-wise Shapley Values for Data Valuation in Classification paper code DA
NIPS'22 Beyond neural scaling laws: beating power law scaling via data pruning paper DS
NIPS'22 Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP paper code DS
NIPS'22 Quantifying memorization across neural language models paper DA
ICML'22 Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments paper DA It proposes the AME score $E_S[U(S\cup {z})-U(S)]$ with $S$ being a random set. The AME score can be approximated by a LASSO model.
ICML'22 Meaningfully Debugging Model Mistakes using Conceptual Counterfactual Explanations papeer code DS DP It learns CAV and move those misclassified training samples toward the direction of CAV.
ICML'22 Datamodels: Predicting Predictions from Training Data paper code DA Datamodels learns a linear model to predict the model output on one test data. It takes as input the one-hot mask of training samples.
ICML'22 Prioritized Training on Points that are learnable, Worth Learning, and Not Yet Learnt paper code DS
ICML'22 Achieving Fairness at No Utility Cost via Data Reweighing with Influence paper code DA It employs DP and EOP to compute IF and performs soft reweighing on training samples. The proof of no-utility-degradation is provided.
ICML'22 DAVINZ: Data Valuation using Deep Neural Networks at Initialization paper DA It uses NTK-based bound to approximate validation performance without training.
ICML'22 Understanding Instance-Level Impact of Fairness Constraint paper code DA IF = IF of loss + IF of fairness constraint. It considers several constraints including DP, EOP, covariance, information, etc. and uses NTK to estimate IF.
ICLR'22 Domino: Discovering systematic errors with cross-modal embeddings paper code DA DP
ICLR'22 Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning paper DA
VLDB'22 Toward Interpretable and Actionable Data Analysis with Explanations and Causality paper DP
SIGMOD'22 Complaint-Driven Training Data Debugging at Interactive Speeds paper DA
SIGMOD'22 Interpretable Data-Based Explanations for Fairness Debugging paper video DA DP
ACL'22 Deduplicating training data makes language models better paper code DS
AAAI'22 Scaling Up Influence Functions paper code DA
AISTATS'22 Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning paper code DA

2021 and before

Venue Paper Links Tags TLDR
NIPS'21 Explaining Latent Representations with a Corpus of Examples paper code DA
NIPS'21 Validation free and replication robust volume-based data valuation paper code DA
NIPS'21 Deep Learning on a Data Diet: Finding Important Examples Early in Training paper DS
NIPS21 Interactive Label Cleaning with Example-based Explanations paper code DP
ICML'21 GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training paper code DS
CVPR'21 Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification? paper code DA
CHI'21 Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency paper DP
NIPS'20 Multi-Stage Influence Function paper DA
NIPS'20 Estimating Training Data Influence by Tracing Gradient Descent paper code DA TracIn measures the influence of training batched samples during training by estimating the test loss change w.r.t. earlier epochs.
ICML'20 On second-order group influence functions for black-box predictions paper DA The influence score of a group = the sum of individual influence per sample + cross-dependencies among samples in the group.
ICML'20 Coresets for data-efficient training of machine learning models paper code DS
ICML'20 Optimizing Data Usage via Differentiable Rewards paper DS
ICML'20 Data Valuation using Reinforcement Learning paper code DA DVRL employs a learnable NN as data value estimator to select data samples during training and use a RL signal to update it.
ICLR'20 Selection via proxy: Efficient data selection for deep learning paper code DS
SIGMOD'20 Complaint Driven Training Data Debugging for Query'2.0 paper video DA
PMLR'20 Identifying Statistical Bias in Dataset Replication paper code
NIPS'19 Data Cleansing for Models Trained with SGD paper code DA The proposed SGD-Influence scales the influence estimation into SGD-base NNs without the convex and optimal assumptions.
ICML'19 Data Shapley: Equitable Valuation of Data for Machine Learning paper code DA
VLDB'19 Efficient task-specific data valuation for nearest neighbor algorithms paper DA
AISTATS'19 Towards Efficient Data Valuation Based on the Shapley Value paper DA
ICML'17 Understanding Black-box Predictions via Influence Functions paper code DA

Surveys

Venue Paper Links Tags
arXiv'24 A Survey on Data Selection for Language Models paper DS
Nature Machine Intelligence'22 Advances, challenges and opportunities in creating data for trustworthy AI paper DS DA
arXiv'23 Data-centric Artificial Intelligence: A Survey paper DS DA DP
arXiv'23 Data Management For Large Language Models: A Survey paper code DS DA
arXiv'23 Training Data Influence Analysis and Estimation: A Survey paper code DA
TKDE'22 Data Management for Machine Learning: A Survey paper DS DA
IJCAI'21 Data Valuation in Machine Learning: "Ingredients", Strategies, and Open Challenges paper DA
TACL'21 Explanation-Based Human Debugging of NLP Models: A Survey paper DP DA

Benchmarks

Venue Paper Links Tags
NIPS'23 DataPerf: Benchmarks for Data-Centric AI Development paper code website DS DA DP
NIPS'23 OpenDataVal: a Unified Benchmark for Data Valuation paper code DA
NIPS'23 Improving multimodal datasets with image captioning paper code DS
NIPS'23 Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias paper code DS
DEEM'22 dcbench: A Benchmark for Data-Centric AI Systems paper code DS

Related Workshops and Tutorials

  1. [ICML'23] DMLR Workshop: Data-centric Machine Learning Research video DMLR Website
  2. [NIPS'23] Tutorial: Data Contribution Estimation for Machine Learning Website

Related Repos

  1. More papers about Data Valuation can be found in awesome-data-valuation. DA
  2. More papers about Data Pruning can be found in Awesome-Coreset-Selection. DS

Reference

[1] Gupta, Nitin, et al. "Data quality for machine learning tasks." Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2021.

[2] Liang, Weixin, et al. "Advances, challenges and opportunities in creating data for trustworthy AI." Nature Machine Intelligence 4.8 (2022): 669-677.