Data Management for Training LLM

A curated list of training data management for large language model resources. The papers are organized according to our survey paper Data Management For Training Large Language Models: A Survey.

Pretraining
Supervised Fine-Tuning
Useful Resources

Pretraining

Domain Composition

Lamda: Language models for dialog applications (Arxiv, Jan. 2022) [Paper] [Code]
Data Selection for Language Models via Importance Resampling (Arxiv, Feb. 2023) [Paper] [Code]
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages (ICLR 2023) [Paper] [Model]
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Arxiv, May 2023) [Paper] [Code]
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
SlimPajama-DC: Understanding Data Combinations for LLM Training (Arxiv, Sep. 2023) [Paper] [Model] [Dataset]
DoGE: Domain Reweighting with Generalization Estimation (Arxiv, Oct. 2023) [Paper] [Code]
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance (Arxiv, Mar. 2024) [Paper] [Code]
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning (ICLR 2024) [Paper] [Code]

Data Quantity

Scaling Laws
- Scaling Laws for Neural Language Models (Arxiv, Jan. 2020) [Paper]
- An empirical analysis of compute-optimal large language model training (NeurIPS 2022) [Paper]
- Unraveling the Mystery of Scaling Laws: Part I (Arxiv, Mar. 2024) [Paper]
Data Repetition
- Scaling Laws and Interpretability of Learning from Repeated Data (Arxiv, May 2022) [Paper]
- Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning (Arxiv, Oct. 2022) [Paper]
- Scaling Data-Constrained Language Models (Arxiv, May 2023) [Paper] [Code]
- To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis (Arxiv, May 2023) [Paper]
- D4: Improving LLM Pretraining via Document De-Duplication and Diversification (Arxiv, Aug. 2023) [Paper]

Data Quality

Quality Filtering
- An Empirical Exploration in Quality Filtering of Text Data (Arxiv, Sep. 2021) [Paper]
- Quality at a glance: An audit of web-crawled multilingual datasets (ACL 2022) [Paper]
- The MiniPile Challenge for Data-Efficient Language Models (Arxiv, April 2023) [Paper] [Dataset]
- A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
- Textbooks Are All You Need (Arxiv, Jun. 2023) [Paper] [Code]
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only (NeurIPS 2023) [Paper] [Dataset]
- Textbooks Are All You Need II: phi-1.5 technical report (Arxiv, Sep. 2023) [Paper] [Model]
- When less is more: Investigating Data Pruning for Pretraining LLMs at Scale (Arxiv, Sep. 2023) [Paper]
- Ziya2: Data-centric Learning is All LLMs Need (Arxiv, Nov. 2023) [Paper] [Model]
- Phi-2: The surprising power of small language models (Blog post, Dec. 2023) [Post]
- QuRating: Selecting High-Quality Data for Training Language Models (ICML 2024) [Paper] [Code]
Deduplication
- Deduplicating training data makes language models better (ACL 2022) [Paper] [Code]
- Deduplicating training data mitigates privacy risks in language models (ICML 2022) [Paper]
- Noise-Robust De-Duplication at Scale (ICLR 2022) [Paper]
- SemDeDup: Data-efficient learning at web-scale through semantic deduplication (Arxiv, Mar. 2023) [Paper] [Code]
Toxicity Filtering
- Detoxifying language models risks marginalizing minority voices (NAACL-HLT, 2021) [Paper] [Code]
- Challenges in detoxifying language models (EMNLP Findings, 2021) [Paper]
- What’s in the box? a preliminary analysis of undesirable content in the Common Crawl corpus (Arxiv, May 2021) [Paper] [Code]
- A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
Diversity & Age
- Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data (Arxiv, Jun. 2023) [Paper]
- D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning (Arxiv, Oct. 2023) [Paper] [Code]
- A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
*Social Biases
- Documenting large webtext corpora: A case study on the Colossal Clean Crawled Corpus (EMNLP 2021) [Paper]
- An empirical survey of the effectiveness of debiasing techniques for pre-trained language models (ACL, 2022) [Paper] [Code]
- Whose language counts as high quality? Measuring language ideologies in text data selection (EMNLP, 2022) [Paper] [Code]
- From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models (ACL 2023) [Paper] [Code]
*Hallucinations
- How Pre-trained Language Models Capture Factual Knowledge? A Causal-Inspired Analysis (ACL 2022) [Paper]
- On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models? (NAACL 2022) [Paper]
- Sources of Hallucination by Large Language Models on Inference Tasks (EMNLP Findings, 2023) (https://arxiv.org/abs/2305.14552)

Relations Among Different Aspects

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
SlimPajama-DC: Understanding Data Combinations for LLM Training (Arxiv, Sep. 2023) [Paper] [Model] [Dataset]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (Arxiv, Jan. 2024) [Paper] [Model]
Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic (CVPR 2024) [Paper] [Code]
Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining (Arxiv, May 2024) [Paper]

Supervised Fine-Tuning

Task composition

Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ Tasks (EMNLP 2022) [Paper] [Dataset]
Finetuned Language Models Are Zero-Shot Learners (ICLR 2022) [Paper] [Dataset]
Multitask Prompted Training Enables Zero-Shot Task Generalization (ICLR 2022) [Paper] [Code]
Scaling Instruction-Finetuned Language Models (Arxiv, Oct. 2022) [Paper] [Dataset]
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization (Arxiv, Dec. 2022) [Paper] [Model]
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning (ICML, 2023) [Paper] [Dataset]
Exploring the Benefits of Training Expert Language Models over Instruction Tuning (ICML, 2023) [Paper] [Code]
Data-Efficient Finetuning Using Cross-Task Nearest Neighbors (ACL Findings, 2023) [Paper] [Code]
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning (Arxiv, May 2023) [Paper]
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources (Arxiv, Jun. 2023) [Paper] [Code]
How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition (Arxiv, Oct. 2023) [Paper]
LESS: Selecting Influential Data for Targeted Instruction Tuning (Arxiv, Feb. 2024)[Paper] [Code]
Instruction Matters, a Simple yet Effective Task Selection Approach in Instruction Tuning for Specific Tasks (Arxiv, Apr. 2024) [Paper]

Data Quality

Instruction Quality
- Self-refine: Iterative refinement with self-feedback (Arxiv, Mar. 2023) [Paper][Project]
- Lima: Less is more for alignment (Arxiv, May 2023) [Paper] [Dataset]
- Enhancing Chat Language Models by Scaling High-quality Instructional Conversations (Arxiv, May 2023) [Paper] [Code]
- SelFee: Iterative Self-Revising LLM Empowered by Self-Feedback Generation (Blog post, May 2023) [Project]
- INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models (Arxiv, Jun. 2023) [Paper] [Code]
- Instruction mining: High-quality instruction data selection for large language models (Arxiv, Jul. 2023) [Paper] [Code]
- AlpaGasus: Training A Better Alpaca with Fewer Data (Arxiv, Jul. 2023) [Paper]
- Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models (Arxiv, Aug. 2023) [Paper]
- Self-Alignment with Instruction Backtranslation (Arxiv. Aug. 2023) [Paper]
- SELF: Language-Driven Self-Evolution for Large Language Models (Arxiv, Oct. 2023) [Paper]
- LoBaSS: Gauging Learnability in Supervised Fine-tuning Data (Arxiv, Oct. 2023) [Paper]
- Tuna: Instruction Tuning using Feedback from Large Language Models (EMNLP 2023) [Paper] [Code]
- Automatic Instruction Optimization for Open-source LLM Instruction Tuning (Arxiv, Nov. 2023) [Paper] [Code]
- MoDS: Model-oriented Data Selection for Instruction Tuning (Arxiv, Nov. 2023) [Paper] [Code]
- One Shot Learning as Instruction Data Prospector for Large Language Models (Arxiv, Dec. 2023) [Paper]
- An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models (Arxiv, Jan. 2024) [Paper]
- Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning (Arxiv, Feb. 2024) [Paper] [Code]
- SelectIT: Selective Instruction Tuning for Large Language Models via Uncertainty-Aware Self-Reflection (Arxiv, Feb. 2024) [Paper] [Code]
- From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning (NAACL 2024) [Paper] [Code]
- Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning (ACL Findings 2024) [Paper] [Code]
- Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models (Arxiv, Feb. 2024) [Paper]
- SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models (Arxiv, Mar. 2024) [Paper]
- Automated Data Curation for Robust Language Model Fine-Tuning (Arxiv, Mar. 2024) [Paper]
- SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning (Arxiv, May 2024) [Paper]
Instruction Diversity
- Self-instruct: Aligning language models with self-generated instructions (ACL 2023) [Paper][Code]
- Stanford Alpaca (Mar. 2023) [Code]
- Enhancing Chat Language Models by Scaling High-quality Instructional Conversation (Arxiv, May 2023) [Paper] [Code]
- Lima: Less is more for alignment (Arxiv, May 2023) [Paper] [Dataset]
- #InsTag: Instruction Tagging for Analyzing Supervised Fine-Tuning of Large Language Models (Arxiv, Aug. 2023) [Paper] [Code]
- Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration (Arxiv, Oct. 2023) [Paper] [Code]
- DiffTune: A Diffusion-Based Approach to Diverse Instruction-Tuning Data Generation (NeurIPS 2023) [Paper]
- Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning (Arxiv, Nov. 2023) [Paper] [Code]
- Data Diversity Matters for Robust Instruction Tuning (Arxiv, Nov. 2023) [Paper]
- Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation (Arxiv, Feb. 2024) [Paper] [Code]
- Multi-view fusion for instruction mining of large language model (Information Fusion Oct. 2024) [Paper]
Instruction Complexity
- WizardLM: Empowering Large Language Models to Follow Complex Instructions (Arxiv, April 2023) [Paper] [Code]
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct (Arxiv, Jun. 2023) [Paper] [Code]
- Orca: Progressive Learning from Complex Explanation Traces of GPT-4 (Arxiv, Jun. 2023) [Paper] [Code]
- A Preliminary Study of the Intrinsic Relationship between Complexity and Alignment (Arxiv, Aug. 2023) [Paper]
- #InsTag: Instruction Tagging for Analyzing Supervised Fine-Tuning of Large Language Models (Arxiv, Aug. 2023) [Paper] [Code]
- Can Large Language Models Understand Real-World Complex Instructions? (Arxiv, Sep. 2023) [Paper] [Benchmark]
- Followbench: A multi-level fine-grained constraints following benchmark for large language models (Arxiv, Oct. 2023) [Paper] [Code]
- Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models (Arxiv, Feb. 2024) [Paper] [Code]
- From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models (Arxiv, Apr. 2024) [Paper] [Code]
*Prompt Design
- Reframing instructional prompts to gptk’s language (ACL Findings, 2022) [Paper] [Code]
- Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts (NAACL, 2022) [Paper] [Code]
- Demystifying Prompts in Language Models via Perplexity Estimation (Arxiv, Dec. 2022) [Paper]
- Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning (ACL, 2023) [Paper] [Code]
- Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning (ACL, 2023) [Paper]
- The False Promise of Imitating Proprietary LLMs (Arxiv, May 2023) [Paper]
- Exploring Format Consistency for Instruction Tuning (Arxiv, Jul. 2023) [Paper]
- Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning (Arxiv, Oct. 2023) [Paper]
- Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace (Arxiv, Oct. 2023) [Paper]
*Hallucinations
- Lima: Less is more for alignment (Arxiv, May 2023) [Paper] [Dataset]
- AlpaGasus: Training A Better Alpaca with Fewer Data (Arxiv, Jul. 2023) [Paper]
- Instruction mining: High-quality instruction data selection for large language models (Arxiv, Jul. 2023) [Paper] [Code]
- Platypus: Quick, Cheap, and Powerful Refinement of LLMs (NeurIPS 2023 Workshop) [Paper] [Code]

Data Quantity

Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases (Arxiv, Mar. 2023) [Paper]
Lima: Less is more for alignment (Arxiv, May 2023) [Paper] [Dataset]
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning (Arxiv, May 2023) [Paper]
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models (Arxiv, Aug. 2023) [Paper] [Code]
How Abilities In Large Language Models Are Affected By Supervised Fine-Tuning Data Composition (Arxiv, Oct. 2023) [Paper]
Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace (Arxiv, Oct. 2023) [Paper]
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method (ICLR 2024) [Paper]

Dynamic Data-Efficient Learning

Training Affects Data
- NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks (SustaiNLP, 2023) [Paper]
- Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning (Arxiv, Jul. 2023) [Paper]
- Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks (EMNLP 2023) [Paper] [Code]
Data Affects Training
- Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation (Arxiv, May 2023) [Paper] [Code]
- OpenChat: Advancing Open-source Language Models with Mixed-Quality Data (Arxiv, Sep. 2023) [Paper] [Code]
- How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition (Arxiv, Oct. 2023) [Paper]
- Contrastive post-training large language models on data curriculum (Arxiv, Oct. 2023)[Paper]
- InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions (NAACL 2024) [Paper]
- Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models (Arxiv, Feb. 2024) [Paper] [Code]
- Strategic Data Ordering: Enhancing Large Language Model Performance through Curriculum Learning (Arxiv, May 2024) [Paper]

Relations Among Different Aspects

#InsTag: Instruction Tagging for Analyzing Supervised Fine-Tuning of Large Language Models (Arxiv, Aug. 2023) [Paper] [Code]
Data Diversity Matters for Robust Instruction Tuning (Arxiv, Nov. 2023) [Paper]
Rethinking the Instruction Quality: LIFT is What You Need (Arxiv, Dec. 2023) [Paper]

Useful Resources

Practical guides for LLM [Repo]
Introduction to LLM [Repo]
Survey of LLM [Repo]
Data-centric AI [Repo]
Scaling laws for LLM [Repo]
Instruction datasets [Repo]
Instruction tuning [Repo1] [Repo2]

ZigeW/data_management_LLM

Data Management for Training LLM

Contents

Pretraining

Domain Composition

Data Quantity

Scaling Laws

Data Repetition

Data Quality

Quality Filtering

Deduplication

Toxicity Filtering

Diversity & Age

*Social Biases

*Hallucinations

Relations Among Different Aspects

Supervised Fine-Tuning

Task composition

Data Quality

Instruction Quality

Instruction Diversity

Instruction Complexity

*Prompt Design

*Hallucinations

Data Quantity

Dynamic Data-Efficient Learning

Training Affects Data

Data Affects Training

Relations Among Different Aspects

Useful Resources