/data-selection-survey

This is a collection of research papers for A Survey on Data Selection for Language Models

Creative Commons Zero v1.0 UniversalCC0-1.0

A Survey on Data Selection for Language Models

GitHub stars GitHub forks License

This repo is a convenient listing of papers relevant to data selection for language models, during all stages of training. This is meant to be a resource for the community, so please contribute if you see anything missing!

For more detail on these works, and more, see our survey paper: A Survey on Data Selection for Language Models. By this incredible team: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang

A conceptual demonstration of the data pipeline for language model training

Table of Contents

Data Selection for Pretraining

Conceptualization of objectives and constraints on data selection for pretraining

Language Filtering

Back to Table of Contents

Heuristic Approaches

Back to Table of Contents

Data Quality

Back to Table of Contents

Domain-Specific Selection

Back to Table of Contents

Data Deduplication

Back to Table of Contents

Filtering Toxic and Explicit Content

Back to Table of Contents

Specialized Selection for Multilingual Models

Back to Table of Contents

Data Mixing

Back to Table of Contents

Data Selection for Instruction-Tuning and Multitask Training

Conceptualization of objectives and constraints on data selection for instruction-tuning

Back to Table of Contents

Data Selection for Preference Fine-tuning: Alignment

Conceptualization of objectives and constraints on data selection for alignment

Back to Table of Contents

Data Selection for In-Context Learning

Conceptualization of objectives and constraints on data selection for in-context learning

Back to Table of Contents

Data Selection for Task-specific Fine-tuning

Conceptualization of objectives and constraints on data selection for task-specific fine-tuning

Back to Table of Contents

Contribution

There are likely some amazing works in the field that we missed, so please contribute to the repo.

Feel free to open a pull request with new papers or create an issue and we can add them for you. Thank you in advance for your efforts!