Awesome ML Data Quality Papers

This is a list of papers about training data quality management for ML models.

Introduction

Data scientists spend ∼80% time on data preparation for an ML pipeline since the data quality issues are unknown beforehand thereby leading to iterative debugging [1]. A good Data Quality Management System for ML (DQMS for ML) helps data scientists break free from the arduous process of data selection and debugging, particularly in the current era of big data and large models. Automating the management of training data quality effectively is crucial for improving the efficiency and quality of ML pipelines.

With the emergence and development of "Data-Centric AI", there has been increasing research focus on optimizing the quality of training data rather than solely concentrating on model structures and training strategies. This is the motivation behind maintaining this repository.

Before we proceed, let's define data quality for ML. In contrast to traditional data cleaning, training data quality for ML refers to the impact of individual or groups of data samples on the behavior of ML models for a given task. It's important to note that the behavior of the model we are concerned with goes beyond performance metrics like accuracy, recall, AUC, MSE, etc. We also consider more generalizable metrics such as model fairness, robustness, and so on.

Considering the following pipeline, DQMS acts as a middleware between data, ML model, and user, necessitating interactions with each of them.

A DQMS for ML typically consists of three components: Data Sculptor [2], Data Attributer, and Data Profiler. To achieve a well-performing ML model, multiple rounds of training are often required. In this process, the DQMS needs to iteratively adjust the training data based on the results of each round of model training. The workflow of DQMS in one round of training is as follows: (a) Data sculptor first acquires the training dataset from a data source and trains the ML model with it. (b) After training for one round (several epochs), Data Attributer absorbs feedback from the model and user's task requirements and computes the data quality assessment. (c) Data Profiler then provides a user-friendly summary of the training data. (d) Meanwhile, Data Sculptor utilizes the data quality assessment as feedback to acquire higher-quality training data, thus initiating a new iteration.

We collect the recent influential papers about DQMS for ML and annotate the relevant DQMS components involved in these papers, where DS = Data Sculptor, DA = Data Attributer, and DP = Data Profiler. The following papers are listed in roughly chronological order of publication.

Paper List

2024

Venue	Paper	Links	Tags	TLDR
arXiv’24	Towards Data Valuation via Asymmetric Data Shapley	paper code	`DA`
arXiv'24	Disentangled Structural and Featural Representation for Task-Agnostic Graph Valuation	paper	`DA`
arXiv'24	Distilling The Knowledge in Data Pruning	paper	`DS`
Openreview	Harnessing Diversity for Important Data Selection in Pretraining Large Language Models	paper	`DA`
Openreview	SAVA: Scalable Learning-Agnostic Data Valuation	paper	`DA`
Openreview	Data Attribution for Multitask Learning	paper	`DA`
Openreview	On the Inflation of KNN-Shapley Value	paper	`DA`
Openreview	Data Valuation for Graphs	paper	`DA`
Openreview	Precedence-Constrained Winter Value for Effective Graph Data Valuation	paper	`DA`
Openreview	Data Shapley in One Training Run	paper	`DA`
Openreview	Generalized Group Data Attribution	paper	`DA`
Openreview	Top-m Data Values Identification	paper	`DA`
NIPS'24	Not All Tokens Are What You Need for Pretraining	paper	`DS`
NIPS'24	Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution	paper code code	`DA`
NIPS'24	Data Distribution Valuation	paper code	`DA`
NIPS'24	DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation	paper	`DA`
NIPS'24	SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning	paper	`DA`	It first divide the data into clusters and compute the Shapley value of clusters, and then select representative data points inside cluster.
NIPS'24	2D-OOB: Attributing Data Contribution through Joint Valuation Framework	paper code	`DA`
NIPS'24	Training Data Attribution via Approximate Unrolling	paper	`DA`
NIPS'24	Data Attribution for Text-to-Image Models by Unlearning Synthesized Images		`DA`
NIPS'24	Efficient Sketches for Training Data Attribution and Studying the Loss Landscape		`DA`
NIPS'24	MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models	paper	`DS`
ICDE'24	When Data Pricing Meets Non-cooperative Game Theory	paper	`DA`
arXiv'24	Data Debiasing with Datamodels (D3M): Improving Subgroup Robustness via Data Selection	paper code	`DS` `DA`
KDD'24	EcoVal: An Efficient Data Valuation Framework for Machine Learning	paper code	`DA`	For the efficiency of data valuation, this work first divide the data into clusters, and compute the cluster value with LOO. It then assign the individual data value inside a cluster through produce function.
KDD'24	Approximating Memorization Using Loss Surface Geometry for Dataset Pruning and Summarization	paper code	`DS` `DP`	This paper shows memorization score is effective for data summarization / selection tasks, and proposes to approximate memorization with SGD.
KDD'24	Scalable Rule Lists Learning with Sampling	paper code	`DP`	This work proposes to learn the approximately optimal rule set through sampling by preserving both accuracy and efficiency.
KDD'24	AIM: Attributing, Interpreting, Mitigating Data Unfairness	paper code	`DP`
KDD'24	CAT: Interpretable Concept-based Taylor Additive Models	paper code
KDD'24	Dataset Regeneration for Sequential Recommendation	paper code	`DS`
arXiv'24	What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions	paper	`DA`
arXiv'24	CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning	paper	`DA`
ICML'24	QuRating: Selecting High-Quality Data for Training Language Models	paper code	`DS`
ICML'24	Scaling Laws for the Value of Individual Data Points in Machine Learning	paper code	`DA`	This work proposes individual scaling law for distinguishing how the marginal contribution of a data point varies as the dataset size growing. It then proposes two methods to estimate the individual scaling law.
ICML'24	Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits	paper	`DA` `DS`	In general cases without considering the structural assumptions of utility functions, Data Shapley’s performance in data selection tasks can be no better than that of random guessing. It proposes a heuristic for predicting Data Shapley’s optimality for data selection.
ICML'24	Incorporating Information into Shapley Values: Reweighting via a Maximum Entropy Approach	paper	`DA`
ICML'24	Distributionally Robust Data Valuation	paper code	`DA`
ICML'24	Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions	paper code	`DA`	It proves that Shapley value shows better robustness compared to LOO and proposes FreeShap to estimate Shapley using eNTK without retraining.
ICML'24	Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset Selection	paper code	`DA`
ICML'24	Optimal Coresets for Low-Dimensional Geometric Median	paper	`DS`
ICML'24	No Dimensional Sampling Coresets for Classification	paper	`DS`
ICML'24	Coresets for Multiple $ℓ_𝑝$ Regression	paper	`DS`
ICML'24	Deletion-Anticipative Data Selection with a Limited Budget	paper	`DS`
ICML'24	Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond	paper	`DS`
ICML'24	Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance Constraints	paper code	`DS`
ICML'24	Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary	paper	`DS`	This work proposes to select a coreset that maintains the decision boundary of model trained on full dataset. It measures the distance between a sample to its nearest decision boundary and selects data based on this distance.
ICML'24	DsDm: Dataset Selection with Datamodels	paper	`DS`	DsDm converts the data selection problem into loss minimization problem in target data. It then uses linear datamodel to approximate the loss mapping and select the bottom-k samples with smallest estimated loss.
ICML'24	BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges		`DS`
ICML'24	LESS: Selecting Influential Data for Targeted Instruction Tuning	paper code	`DS`
ICML'24	Exploiting Negative Samples: A Catalyst for Cohort Discovery in Healthcare Analytics	paper	`DA` `DP`	This work proposes to leverage data Shapley value to value each data in negative sample, and employs manifold learning and clustering to find influential patterns in negative samples.
CVPR'24	The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes	paper	`DS`
VLDB'24	Counterfactual Explanation of Shapley Value in Data Coalitions	paper code	`DA`	If the Shapley value of data owner A is higher than B, the counterfactual explanation aims to find a smallest subset of data in A that such that moving it from A to B makes Shapley value of A less than that of B. This work proposes a greedy based search to find the counterfactual.
VLDB'24	P-Shapley: Shapley Values on Probabilistic Classifiers	paper code	`DA`	This paper introduces P-Shapley with raw probability (instead of accuracy) as utility function and proposes calibration function to enlarge the utility change when the predicted probability is high.
VLDB'24	MetaStore: Analyzing Deep Learning Meta-Data at Scale	paper	`DA` `DP`
VLDB'24	Optimizing Data Acquisition to Enhance Machine Learning Performance	paper code	`DS`
VLDB'24	MisDetect: Iterative Mislabel Detection using Early Loss	paper code	`DA`
VLDB'24	Outlier Summarization via Human Interpretable Rules	paper code	`DP`	It trains a decision tree model to summarize the rule patterns of outliers.
VLDB'24	Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities	paper code	`DS`	It uses generative AI for augmentation, ensuring that the generated data covering the original data distribution with a smallest size.
VLDB'24	DataPrice: An Interactive System for Pricing Datasets in Data Marketplaces	paper	`DA`
SIGMOD'24	Rock: Cleaning Data by Embedding ML in Logic Rules	paper	`DP` `DA`	Rock implements a uniform data cleaning framework that unifies ML and logic deduction.
SIGMOD'24	Data Acquisition for Improving Model Confidence	paper	`DS`
SIGMOD'24	Controllable Tabular Data Synthesis Using Diffusion Models	paper	`DS`
SIGMOD'24	Fast Shapley Value Computation in Data Assemblage Tasks as Cooperative Simple Games	paper code	`DA`	It assigns a Shapley score for data owners and their corresponding datasets in data market.
WWW'24	Exploring Neural Scaling Law and Data Pruning Methods For Node Classification on Large-scale Graphs	paper code	`DS`	This work selects training nodes that are similar to test nodes by minimizing their bottleneck distance. To avoid bias caused by trivial selection, it uses a greedy alg. to assure the representativeness of selected nodes.
AAAI'24	DeRDaVa: Deletion-Robust Data Valuation for Machine Learning	paper	`DA`
AAAI'24	Quality-Diversity Generative Sampling for Learning with Synthetic Data	paper code	`DS`
AAAI'24	Approximating the Shapley Value without Marginal Contributions	paper	`DA`	It transfer Shapley value by $\phi_i = \phi_i^+ + \phi_i^-$. It samples coalitions and update $\phi_i^+$ and $\phi_i^-$ separately.
WSDM'24	FairIF: Boosting Fairness in Deep Learning via Influence Functions with Validation Set Sensitive Attributes	paper	`DA`
WSDM'24	Efficient, Direct, and Restricted Black-Box Graph Evasion Attacks to Any-Layer Graph Neural Networks via Influence Function	paper code	`DA`
ICLR'24	"What Data Benefits My Classifier?" Enhancing Model Performance and Interpretability through Influence-Based Data Selection	paper code	`DS` `DA`	It extends influence function considering utility, fairness and robustness. It trains a decision tree to further estimate and interpret the influence score.
ICLR'24	Canonpipe: Data Debugging with Shapley Importance over Machine Learning Pipelines	paper code	`DA`	It explores data valuation on raw data before preprocessing. It uses data provenance in ML pipelines and proposes data Shapley under a KNN approximation.
ICLR'24	Time Travel in LLMs: Tracing Data Contamination in Large Language Models	paper code	`DA`	Data contamination means the presence of test data from downstream tasks in the pre-training data of LLMs. This work explore both instance and partition level methods to identify potential contamination.
ICLR'24	GIO: Gradient Information Optimization for Training Dataset Selection	paper code	`DA`	GIO selects a small subset of data from large source data by minimizing the KL divergence between the target distribution and subset.
ICLR'24	Intriguing Properties of Data Attribution on Diffusion Models	paper code	`DA`	This paper proposes D-TRAK to attribute images generated by diffusion models back to the training data.
ICLR'24	D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning	paper code	`DS`	A data pruning method that takes diversity into consideration. It is implemented by forward and reverse message passing in the KNN graph.
ICLR'24	Effective pruning of web-scale datasets based on complexity of concept clusters	paper code	`DS`
ICLR'24	Towards a statistical theory of data selection under weak supervision	paper	`DS`
ICLR'24	Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs	paper code	`DS`
ICLR'24	DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models	paper code	`DA`	DataInf approximate influence function by swapping the order of the matrix inversion and average calculation.
ICLR'24	What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning	paper	`DS`
ICLR'24	Real-Fake: Effective Training Data Synthesis Through Distribution Matching	paper code	`DS`
ICLR'24	InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning	paper code	`DS`	InfoBatch uses training loss to prune well-learned samples in each epoch and estimate gradient distribution for unbiased learning.
arXiv'24	A Decade's Battle on Dataset Bias: Are We There Yet?	paper code	`DS`
arXiv'24	On the Cause of Unfairness: A Training Sample Perspective	paper	`DA`	The fairness influence can be computed by replacing the training sample with its concept counterfactual sample.

2023

Venue	Paper	Links	Tags	TLDR
arXiv'23	Accelerated Shapley Value Approximation for Data Evaluation	paper	`DA`	Not all coalition sizes are evaluated, small coalitions may introduce noise and large ones may have little contributions. To estimate the effect of coalitions with size k, about O(1 / k^2) sample coalitions is sufficient.
arXiv'23	The Journey, Not the Destination: How Data Guides Diffusion Models	paper code	`DA`	-
NIPS'23	The Memory Perturbation Equation: Understanding Model’s Sensitivity to Data	paper code	`DA` `DP`	-
NIPS'23	Theoretical and Practical Perspectives on what Influence Functions Do	paper	`DA`	This work discusses some problematic assumptions of IF. While most of them can be addressed, IF can predict perturbated param accurately for a limited amount of time-steps.
NIPS'23	Data Selection for Language Models via Importance Resampling	paper code	`DS` `DA`	It selects data satisfying a target distribution from raw data by reducing KL divergence to the target over random selection.
NIPS'23	Model Shapley: Equitable Model Valuation with Black-box Access	paper code	`DA`	-
NIPS'23	Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation	paper	`DA`	Extend KNN-Shapley while considering data privacy.
NIPS'23	GEX: A flexible method for approximating influence via Geometric Ensemble	paper code	`DA`	-
NIPS'23	Efficient Data Subset Selection to Generalize Training Across Models: Transductive and Inductive Networks	paper code	`DS`	-
NIPS'23	Data Pruning via Moving-one-Sample-out	paper	`DS`	This work proposes a Moso score (similar to LOO) and an approximates it using gradient over all training epochs.
NIPS'23	Towards Free Data Selection with General-Purpose Models	paper code	`DS`	-
NIPS'23	Towards Accelerated Model Training via Bayesian Data Selection	paper	`DS`	-
NIPS'23	Robust Data Valuation with Weighted Banzhaf Values	paper	`DA`	-
NIPS'23	UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models	paper	`DS`	-
NIPS'23	Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources	paper code	`DS`	Given publicly known pilot data from different data sources, it returns the optimal combination of data sources.
NIPS'23	Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy	paper code	`DS`	-
NIPS'23	Spuriosity Rankings: Sorting Data to Measure and Mitigate Biases	paper	`DS`	-
NIPS'23	Core-sets for Fair and Diverse Data Summarization	paper code	`DS` `DP`	It selects a fixed size of coreset for different groups of data while preserving diversity.
NIPS'23	Retaining Beneficial Information from Detrimental Data for Neural Network Repair	paper	`DS`	-
NIPS'23	Expanding Small-Scale Datasets with Guided Imagination	paper code	`DS`	-
NIPS'23	Error Discovery By Clustering Influence Embeddings	paper code	`DA`	This work cluster influence embedding (a low dimension of influence vector of training samples) for all test samples to summarize the prediction error.
NIPS'23	HiBug: On Human-Interpretable Model Debug	paper code	`DP` `DS`	-
NIPS'23	Skill-it! A data-driven skills framework for understanding and training language models	paper code	`DP` `DS`	-
ICML'23	Discover and Cure: Concept-aware Mitigation of Spurious Correlation	paper code	`DS` `DA`	Discover spurious correlation from concept level and perform concept-based data augmentation to mitigate bias.
ICML'23	TRAK: Attributing Model Behavior at Scale	paper code	`DA`	TRAK first defines a Newton approximation to estimate LOO for logistic regression and then extends it to NNs (including CLIP, mT5) by view them as the linear model acting on input gradient.
ICML'23	RGE: A Repulsive Graph Rectification for Node Classification via Influence	paper code	`DA`	RGE identifies a group of negative edges that are most harmful for GNNs. It iteratively selects negative edges by their individual influence and prefers distant edges first.
ICML'23	Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value	paper code	`DA`	Data-OOB measures the average score when a datum (OOB data) is not selected in the bootstrap dataset.
ICML'23	2D-Shapley: A Framework for Fragmented Data Valuation	paper	`DA`
ICML'23	Towards Sustainable Learning: Coresets for Data-efficient Deep Learning	paper code	`DS`	-
ICML'23 Workshop	Training on Thin Air: Improve Image Classification with Generated Data	paper	`DS`	-
ICML'23 Workshop	Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation	paper code	`DA` `DS`	-
VLDB'23	Equitable Data Valuation Meets the Right to Be Forgotten in Model Markets	paper code	`DA`	-
VLDB'23	Computing Rule-Based Explanations by Leveraging Counterfactuals	paper code	`DP`	-
VLDB'23	Data Collection and Quality Challenges for Deep Learning	paper	`DS` `DA`	-
SIGMOD'23	GoodCore: Coreset Selection over Incomplete Data for Data-effective and Data-efficient Machine Learning	paper	`DS`	GoodCore selects a coreset that achieves expected low gradient approximation error among all possible worlds of missing data.
SIGMOD'23	XInsight: eXplainable Data Analysis Through The Lens of Causality	paper	`DP`	-
SIGMOD'23	HybridPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation	paper code	`DS` `DP`	-
SIGMOD'23	Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications	paper
ACL'23	Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values	paper	`DA`
arXiv'23	Simfluence: Modeling the influence of individual training examples by simulating training runs	paper	`DS`	Trains a simulator that generates a time series that predicts what the loss on $z_{test}$ would be after each step of the training run (a loss trajectory).
ICLR'23	Data Valuation Without Training of a Model	paper code	`DA`	It proposes a score to measures the gap in data complexity where a certain data instance is removed from the full dataset.
ICLR'23	Distilling Model Failures as Directions in Latent Space	paper code	`DS` `DP`	-
ICLR'23	LAVA: Data Valuation without Pre-Specified Learning Algorithms	paper code	`DA`	LAVA uses a Wasserstein distance to estimate the upper bound of test performance. It values a training sample by its sensitivity to the distance.
ICLR'23	Concept-level Debugging of Part-Prototype Networks	paper code	`DP`	-
ICLR'23	Dataset Pruning: Reducing Training Data by Examining Generalization Influence	paper	`DS`	-
ICLR'23	Moderate Coreset: A Universal Method of Data Selection for Real-world Data-efficient Deep Learning	paper code	`DS`	-
ICLR'23	Learning to Estimate Shapley Values with Vision Transformers	paper code	`DA`	-
ICLR'23	Characterizing the Influence of Graph Elements	paper code	`DA`	Introduce influence function into graphs, considering node- and edge-removal influence and the linear SGC model.
ICLR'23	Dataset pruning: Reducing training data by examining generalization influence.	paper	`DA`
ICDE'23	Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise	paper code	`DP`	-
ICDE'23	Detection of Groups with Biased Representation in Ranking	paper	`DA`	-
AAAI'23	Fundamentals of Task-Agnostic Data Valuation	paper	`DA`	-
AAAI'23	Interpreting Unfairness in Graph Neural Networks via Training Node Attribution	paper code	`DA`	This work proposes a Probabilistic Distribution Disparity to define node-contributed model bias and use gradient approximation to estimate node-level bias.
AAAI'23	Interpreting Unfairness in Graph Neural Networks via Training Node Attribution	paper code	`DA`
WWW'23	GIF: A General Graph Unlearning Strategy via Influence Function	paper code	`DA`	GIF extends influence function to graph data by considering both the directly affected node(s) and the influenced neighborhoods.
AISTATS'23	Data Banzhaf: A Robust Data Valuation Framework for Machine Learning	paper	`DA`	-
arXiv'23	Data-Juicer: A One-Stop Data Processing System for Large Language Models	paper code	`DS` `DP`	-
arXiv'23	Simﬂuence: Modeling the inﬂuence of individual training examples by simulating training runs	paper	`DA`
arXiv'23	Studying Large Language Model Generalization with Influence Functions	paper	`DA`	-
TMLR'23	Synthetic Data from Diffusion Models Improves ImageNet Classification	paper	`DS`	-

2022

Venue	Paper	Links	Tags	TLDR
NIPS'22	CS-SHAPLEY: Class-wise Shapley Values for Data Valuation in Classification	paper code	`DA`
NIPS'22	Beyond neural scaling laws: beating power law scaling via data pruning	paper	`DS`
NIPS'22	Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP	paper code	`DS`
NIPS'22	Quantifying memorization across neural language models	paper	`DA`
ICML'22	Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments	paper	`DA`	It proposes the AME score $E_S[U(S\cup {z})-U(S)]$ with $S$ being a random set. The AME score can be approximated by a LASSO model.
ICML'22	Meaningfully Debugging Model Mistakes using Conceptual Counterfactual Explanations	papeer code	`DS` `DP`	It learns CAV and move those misclassified training samples toward the direction of CAV.
ICML'22	Datamodels: Predicting Predictions from Training Data	paper code	`DA`	Datamodels learns a linear model to predict the model output on one test data. It takes as input the one-hot mask of training samples.
ICML'22	Prioritized Training on Points that are learnable, Worth Learning, and Not Yet Learnt	paper code	`DS`
ICML'22	Achieving Fairness at No Utility Cost via Data Reweighing with Influence	paper code	`DA`	It employs DP and EOP to compute IF and performs soft reweighing on training samples. The proof of no-utility-degradation is provided.
ICML'22	DAVINZ: Data Valuation using Deep Neural Networks at Initialization	paper	`DA`	It uses NTK-based bound to approximate validation performance without training.
ICML'22	Understanding Instance-Level Impact of Fairness Constraint	paper code	`DA`	IF = IF of loss + IF of fairness constraint. It considers several constraints including DP, EOP, covariance, information, etc. and uses NTK to estimate IF.
ICSE'22	Training data debugging for the fairness of machine learning software	paper code	`DS`
ICLR'22	Domino: Discovering systematic errors with cross-modal embeddings	paper code	`DA` `DP`
ICLR'22	Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning	paper	`DA`
VLDB'22	Toward Interpretable and Actionable Data Analysis with Explanations and Causality	paper	`DP`
SIGMOD'22	Complaint-Driven Training Data Debugging at Interactive Speeds	paper	`DA`
SIGMOD'22	Interpretable Data-Based Explanations for Fairness Debugging	paper video	`DA` `DP`
ACL'22	Deduplicating training data makes language models better	paper code	`DS`
AAAI'22	Scaling Up Influence Functions	paper code	`DA`
AAAI'22	Incentivizing collaboration in machine learning via synthetic data rewards	paper	`DA`
AISTATS'22	Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning	paper code	`DA`

2021 and before

Venue	Paper	Links	Tags	TLDR
NIPS'21	Explaining Latent Representations with a Corpus of Examples	paper code	`DA`
NIPS'21	Validation free and replication robust volume-based data valuation	paper code	`DA`
NIPS'21	Deep Learning on a Data Diet: Finding Important Examples Early in Training	paper	`DS`
NIPS21	Interactive Label Cleaning with Example-based Explanations	paper code	`DP`
ICML'21	GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training	paper code	`DS`
CVPR'21	Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification?	paper code	`DA`
CHI'21	Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency	paper	`DP`
NIPS'20	Multi-Stage Influence Function	paper	`DA`
NIPS'20	Estimating Training Data Influence by Tracing Gradient Descent	paper code	`DA`	TracIn measures the influence of training batched samples during training by estimating the test loss change w.r.t. earlier epochs.
ICML'20	On second-order group influence functions for black-box predictions	paper	`DA`	The influence score of a group = the sum of individual influence per sample + cross-dependencies among samples in the group.
ICML'20	Coresets for data-efficient training of machine learning models	paper code	`DS`
ICML'20	Optimizing Data Usage via Differentiable Rewards	paper	`DS`
ICML'20	Data Valuation using Reinforcement Learning	paper code	`DA`	DVRL employs a learnable NN as data value estimator to select data samples during training and use a RL signal to update it.
ICML'20	Collaborative Machine Learning with Incentive-Aware Model Rewards	paper	`DA`
ICLR'20	Selection via proxy: Efficient data selection for deep learning	paper code	`DS`
SIGMOD'20	Complaint Driven Training Data Debugging for Query'2.0	paper video	`DA`
PMLR'20	Identifying Statistical Bias in Dataset Replication	paper code
NIPS'19	Data Cleansing for Models Trained with SGD	paper code	`DA`	The proposed SGD-Influence scales the influence estimation into SGD-base NNs without the convex and optimal assumptions.
ICML'19	Data Shapley: Equitable Valuation of Data for Machine Learning	paper code	`DA`
VLDB'19	Efficient task-specific data valuation for nearest neighbor algorithms	paper	`DA`
AISTATS'19	Towards Efficient Data Valuation Based on the Shapley Value	paper	`DA`
ICML'17	Understanding Black-box Predictions via Influence Functions	paper code	`DA`

Surveys

Venue	Paper	Links	Tags
arXiv'24	A Survey on Data Selection for Language Models	paper	`DS`
Nature Machine Intelligence'22	Advances, challenges and opportunities in creating data for trustworthy AI	paper	`DS` `DA`
arXiv'23	Data-centric Artificial Intelligence: A Survey	paper	`DS` `DA` `DP`
arXiv'23	Data Management For Large Language Models: A Survey	paper code	`DS` `DA`
arXiv'23	Training Data Influence Analysis and Estimation: A Survey	paper code	`DA`
TKDE'22	Data Management for Machine Learning: A Survey	paper	`DS` `DA`
IJCAI'21	Data Valuation in Machine Learning: "Ingredients", Strategies, and Open Challenges	paper	`DA`
TACL'21	Explanation-Based Human Debugging of NLP Models: A Survey	paper	`DP` `DA`

Benchmarks

Venue	Paper	Links	Tags
NIPS'23	DataPerf: Benchmarks for Data-Centric AI Development	paper code website	`DS` `DA` `DP`
NIPS'23	OpenDataVal: a Unified Benchmark for Data Valuation	paper code	`DA`
NIPS'23	Improving multimodal datasets with image captioning	paper code	`DS`
NIPS'23	Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias	paper code	`DS`
DEEM'22	dcbench: A Benchmark for Data-Centric AI Systems	paper code	`DS`

Related Repos

More papers about Data Valuation can be found in awesome-data-valuation. DA
More papers about Data Pruning can be found in Awesome-Coreset-Selection. DS

Reference

[1] Gupta, Nitin, et al. "Data quality for machine learning tasks." Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2021.

[2] Liang, Weixin, et al. "Advances, challenges and opportunities in creating data for trustworthy AI." Nature Machine Intelligence 4.8 (2022): 669-677.

SJTU-DMTai/awesome-ml-data-quality-papers