Awesome Synthetic (text) datasets

Generating datasets using large language models

What is synthetic data?

Synthetic data refers to artificially generated data that usually aims to mimic real-world data. This data is created algorithmically, often using models or simulations, rather than collected from real-world sources. Synthetic data has been used for a long time in machine learning. Since the advent of LLMs, there has been increasing use of LLMs for producing synthetic data and for using synthetic data for training LLMs.

Resources

This repository aims to organize resources focused on helping people (including myself) get started with building synthetic datasets. As a result, it will only cover some things and will focus on pragmatic and practical resources for the most part.

Tutorials, guides and educational blog posts

Synthetic data: save money, time and carbon with open source: shows how to use an open-source LLM to create synthetic data to train your customized model in a few steps.

Examples in this repository

Generating Embedding Data with LLMs

Important techniques

Important Datasets

TinyStories

TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.

OpenHermes-2.5

The Open Hermes 2.5 dataset is a continuation of the Open Hermes 1 dataset, at a much larger scale, much more diverse, and much higher quality compilation, reaching 1M, primarily synthetically generated instruction and chat samples.

Cosmopedia

Cosmopedia is a dataset of synthetic textbooks, blogposts, stories, posts and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.The dataset contains over 30 million files and 25 billion tokens, making it the largest open synthetic dataset to date.

A reproduction of Textbooks Are All You Need. More detail in this blog post

WebSight

WebSight is a large synthetic dataset containing HTML/CSS codes representing synthetically generated English websites, each accompanied by a corresponding screenshot.

synthetic_text_to_sql

gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details.

Libraries, code and tools

This list isn't compressive and tries to focus on either actively developed libraries or libraries/code examples that demonstrate a particular approach well.

distilabel

⚗️ distilabel is a framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.

This is a very flexible library that is actively being developed and improved. It supports a large number of LLM providers. It includes many common synthetic data generation techniques from papers such as Self-Instruct and EvolInstruct.

llm-swarm

Manage scalable open LLM inference endpoints in Slurm clusters

This library is primarily focused on scaling synthetic text generation using Slurm clusters. This library was used to generate Cosmopedia

Domain Specific Dataset Project

This is a project to bootstrap the creation of domain-specific datasets for training models. The goal is to create a set of tools that help users to collaborate with domain experts. Includes a UI tool for creating a pipeline for generating a synthetic dataset focused on a particular domain.

AutoPrompt

A framework for prompt tuning using Intent-based Prompt Calibration

This framework aims to help you automatically generate high-quality, detailed prompts. It uses a refinement process, where it iteratively builds a dataset of challenging edge cases and optimizes the prompt accordingly. This approach aims to reduce the manual effort in prompt engineering and reduce prompt sensitivity.

Self-Contrast

Self-Contrast is an innovative method that offers an annotation-free approach for aligning with human preference.

This is the code accompanying the Starling team's paper "Extensive Self-Contrast Enables Feedback-Free Language Model Alignment". The "Nectar" synthetic dataset is used to train a reward model, which is then used to train a large language model by Reinforcement Learning with AI Feedback (RLAIF).

Important Papers

Some important papers about synthetic data generation. Note I'm not trying to add every possible paper on the topic, but rather focus on those that introduced important techniques or had a big impact (particularly "in the wild"). I am collecting a longer list of papers in this Hugging Face Collection.

Textbooks Are All You Need
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Self-Instruct: Aligning Language Model with Self Generated Instructions
WizardLM: Empowering Large Language Models to Follow Complex Instructions (Also check out the other Wizard papers i.e. WizardCoder: Empowering Code Large Language Models with Evol-Instruct).
Improving Text Embeddings with Large Language Models
Extensive Self-Contrast Enables Feedback-Free Language Model Alignment (From the Starling team).

Procureallyai/awesome-synthetic-datasets