Post-Training-Data-Flywheel

Goal

We aim to provide the best references to search, select, and synthesize high-quality and large-quantity data for post-training your LLMs.

Introduction

There are three contributions in this repository:

Data Generation: We provide the data generation process for two important domains: instruction following and function calling.
Dataset Compilation: We collected and compiled a list of high-quality datasets for post-training LLMs in the domains of instruction following, coding, and math. We provide a quality check for the datasets.
Dataset Curation: According to the quality check, we carefully curated a new dataset for post-training LLMs. The datasets are carefully collected and evaluated to ensure high quality and relevance for post-training.

Disclaimer: Although we provide the license information below, this information is obtained from the original repository. However, we also noticed that some datasets, while claimed to be open, actually utilize commercial models. Therefore, please double-check carefully before using them, especially if you intend to use them for commercial purposes or something similar.

Download

You can download the datasets directly from the Hugging Face Hub. There are two versions:

Flywheel-v1: A small and highly curated datasets.
Flywheel-v2: A large and diverse datasets. (recommended)

Data Generation

We provide the data generation process for two important domains: instruction following and function calling.

Quality Check

Domain: we are only concerned about the following tasks: instruction following, coding, and math. Datasets other than those in English are not considered.
Data source: only keep GPT-4 generated data. Drop inferior data sources (gpt-3.5-turbo).
popular dataset, download > 1K.
Accuracy (%): randomly sample 20 for the instruction tuning dataset and 10 for other domains. Check the quality manually and provide quality signal = x / 20
Relevance Score (1-5):
- 5: Directly corresponds to one of [IFEval*, MTBench, AGIEval*, AlpacaEval, …] (Overfitting)
- 4: Generally have instruction following format and GPT-4 / human level response.
- 3: Most have instruction following format and correct response.
- 2: Have major flaws (e.g. irrelevant) but may be useful
- 1: low quality or potentially harmful impact

Dataset

Function Calling

Name	Description	Quantity	Accuracy	Relevance	License
glaiveai/glaive-function-calling-v2	Click to expand No duplicate in first 10. Wide variety of tasks.	113K	4.5	4.5	apache-2.0
Salesforce/xlam-function-calling-60k	Click to expand Answers are function names and parameter lists. Contains functions with ambiguous parameter types and trivial functions	60K	5	4.5	cc-by-4.0
Gorilla OpenFunctions-v2	Click to expand GitHub JSON format data, no Hugging Face dataset. Uses AST to determine if API calls are correct	17K	5	5	Apache-2.0
NousResearch/hermes-function-calling-v1	Click to expand Function calling split contains tool call and response, compared to the singleturn split.	1893		4.5	apache-2.0

Code

Name	Description	Quantity	Accuracy	Relevance	Notes for Quality	License
ise-uiuc/Magicoder-OSS-Instruct-75K	Click to expand Question 1 gives task, inputs, constraints, example (Leetcode style), question 2 gives method signature, question 3 gives just problem description	75.2K	4.5	3.5		mit
RLHFlow/CodeUltraFeedback-standard	Click to expand RLHF format, including chosen and rejected, The total chosen-rejected pairs are 50156 while the unique chosen answers are around 38.4K	38.4k/50.2k (see notes)	4	4	Sizes are unique chosen answers and total chosen-rejected pairs, respectively	mit
codeparrot/apps	Click to expand Competitive Programming (Codeforces) style prompts with inputs, constraints, examples, descriptions. Includes separate test cases. Sometimes method signature provided. Relatively complicated. Items too long to check.	10K	N/A	N/A		mit
iamtarun/python_code_instructions_18k_alpaca	Click to expand Prompts with supplied examples sometimes. Even with supplied examples, models only sometimes give the corresponding output	18.6K	5	4		N/A
bigcode/self-oss-instruct-sc2-exec-filter-50k	Click to expand Final self-alignment training dataset for StarCoder2-Instruct.	50.7k				odc-by
theblackcat102/evol-codealpaca-v1	Click to expand Similar to ise-uiuc/Magicoder-Evol-Instruct-110K.	111k				apache-2.0

Math

Name	Description	Quantity	Accuracy	Relevance	License
meta-math/MetaMathQA	Click to expand Question 4 does not provide public machine expressions. Original questions sometimes rewritten to be parameterized.	395k	4.75	4.5	mit
MathInstruct	Click to expand Contains 13 datasets, such as camel math, etc. We examined the first 10 questions. Question 2 does have candidates and the answer is correct Question 3 does not provide a specific answer Question 5 does not provide a specific answer Most do not provide specific answers	262K	4.5	3	mit
camel-ai/math	Click to expand Dataset is composed of 50K problem-solution pairs obtained using GPT-4	50k	5	4.5	cc-by-nc-4.0
xinlai/Math-Step-DPO-10K	Click to expand RLHF format, including chosen and rejected. Use step-by-step prompt. `initial_reason_steps` includes preliminary calculation and hints.	10.8k	4.5	3.5	cc-by-nc-4.0
openai/gsm8k	Click to expand Commonly used for many benchmarks, including the LLM Leaderboard. `Answer` includes `<>` formated calculation.	train 7.47k test 1.32K	5	4.5	mit

Instruction Following

Name	Description	Quantity	Accuracy	Relevance	Notes for Quality	License
Open-Orca/1million-gpt-4	Click to expand FLAN collection which has been augmented by submitting the listed question to GPT-4. Many questions supply a passage as context.	1M	5	4		N/A
Open-Orca/SlimOrca	Click to expand This release provides an efficient means of reacting our OpenOrca dataset with using larger slices of our data, while only including ~500k GPT-4 completions. Many questions supply a passage as context	518k	5	4		mit
teknium/GPT4-LLM-Cleaned	Click to expand Instruction-Following Data generated by GPT-4 using Alpaca prompts. Separated into main instruction, with optional accompanying parameter. E.g. "instruction": "what does this code do?", " input":"def function()"	54.6k	5	4		apache-2.0
databricks/databricks-dolly-15k	Click to expand Dolly2.0 (Pairs, English, 15K+ entries) — A dataset of human-written prompts and responses, featuring tasks like question-answering and summarization. Categorized questions, e.g. "closed_qa", "classification", "open_qa", etc. Sometimes an optional "context" parameter is supplied.	15k	5	4		cc-by-sa-3.0
allenai/WildChat-1M (GPT4-EN)	Click to expand 1 million conversations between human users and ChatGPT. 25.53% of the conversations come from the GPT-4 chatbot, while the rest come from the GPT-3.5 chatbot. Contains accompanying scores/classifications on various categories of harmfulness, e.g. "harassment", "self-harm", etc. Many non-English entries.	168k	4	5	filter gpt-4-en. Size refers to gpt-4 entries only	odc-by
sablo/oasst2_curated	Click to expand A filtered and curated dataset taken from the top scoring OpenAssistant/oasst2 conversations. Saved in HF Chat format. The result is a high quality dataset for SFT.	train 4.69k, test 24	5	4	open-ended conversation, human annotated	apache-2.0
CollectiveCognition/chats-data-2023-09-22	Click to expand Collection of chats between users and the ChatGPT model. These conversations have been shared by users on the "Collective Cognition" website. Includes ChatGPT generated conversation titles.	156	4.75	4	Human: after filter out GPT-4	mit
lmsys/lmsys-chat-1m	Click to expand one million real-world conversations with 25 state-of-the-art LLMs. Includes conversation topics with model tags, language, harmfulness ratings across multiple axes, and PII redaction. Many non-English prompts.	1M	4.5	4	Human: after filter out GPT-4	LMSYS-Chat-1M Dataset License
teknium/GPTeacher-General-Instruct	Click to expand GPT-4 Generated self-instruct dataset. Mix of open/closed qa, rewriting, answering questions based on supplied passage.	89.3k	4.5	4	gpt-4 generated	mit
stingning/ultrachat	Click to expand Some data inside the 774K are very long, basically exceeding 10000 in length. Questions and responses combined into one field.	774k	4.5	4	Human: The dialogue is a list of strings chatgpt generated with human refinements	mit
jondurbin/airoboros-3.2	Click to expand modified self-instruct gpt4. Contains some harmful/toxic content.	58,709	4.5	4	Accuracy: Errors in mathematical calculations. Data was generated primarily with gpt-4	cc-by-4.0
openbmb/UltraInteract_sft	Click to expand a large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks.	289K	4	5	specifically for reasoning	mit
AutoIF	Click to expand Synthetic dataset that matches IFEval, no open source download available. Restrictions on output format, length. E.g. 50 words, 5 sentences, 4-syllable words, palindromes. Strong emphasis on conciseness.	N/A	N/A	N/A	hack IFEval to generate data	apache-2.0
WizardLM/WizardLM_evol_instruct_V2_196k	Click to expand original wizard lm data	143k	4.5	3	Human: Some errors; gpt-3.5-turbo generated; evolved from mixture of Alpaca and ShareGPT	mit
TIGER-Lab/WebInstructSub	Click to expand vast amounts of high-quality instruction data exist in the web corpus, spanning various domains like math and science. Specifically contains data from mathstackexchange, stackexchange, and socratic.	2.34M	5	3	Human: not relevant	apache-2.0
allenai/soda	Click to expand Dialogue dataset covering a wide range of social interactions.	train 1.19M validation 146k test 149k	5	3	Accuracy: Discrepancy in the amount of dialogue and conversation data. Dialogue contains proper_name information. Human: not GPT-4 level	cc-by-4.0
nvidia/Daring-Anteater	Click to expand consisting of 100k conversations, each averaging 2.88 model turns, generated using NVIDIA proprietary model and Mistral-8x7B-Instruct-v0.1, while the remaining samples are sourced from FinQA, wikitablequestions, and commercially-friendly subsets of Open-Platypus	99.5k	5	3	Human: from NVIDIA proprietary models and Mistral-8x7B-Instruct-v0.1 not GPT-4	cc-by-4.0
yahma/alpaca-cleaned	Click to expand Some Alpaca/ LLaMA-like models (Pairs, English) — Cleaned version of Alpaca, GPT_LLM, and GPTeacher. Cleaned to correct: hallucinations, merged instructions, empty outputs, empty code examples, instructions to generate images, N/A outputs, wrong answers (?), non-sensical/unclear instructions, extra escape and control characters	52k	5	3	Should review some of the choices for cleaning data.	cc-by-4.0
tatsu-lab/alpaca	Click to expand ChatGLM-fine-tune-LoRA; Koala (Dialog, Pairs, English, 52K entries, 21.4MB) — A dataset generated by text-davinci-003 to enhance language models' ability to follow human instruction. Contains instruction field (all unique), optional input in ~40% of data, model output, and finally a formatted combination following a prompt template.	52k	4.5	3		cc-by-nc-4.0
cascip/ChatAlpaca	Click to expand use ChatGPT (GPT-3.5-turbo) to generate follow-up utterances and continue the conversation with ChatGPT	20k	4	3		apache-2.0
philschmid/guanaco-sharegpt-style	Click to expand Some code content, mostly general conversations. Mostly non-English	9.03k	3	3	Accuracy: Many foreign languages. Human: After filtering, a high-quality GPT4 daily Q&A dataset, size 6K, mainly knowledge Q&A, programming questions, reasoning calculations, including Simplified Chinese, Traditional Chinese, English, Japanese, Korean, and various languages	N/A
andersonbcdefg/gpt4all	Click to expand Questions from stackoverflow. Contains HTML tags.	438k	3	2	Human: prompt is html coding and math, not relevant to instruction following	N/A
OpenAssistant/oasst1	Click to expand This version of the dataset contains data collected on the open-assistant.io website until April 12 2023. Human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees.	train:84.4k val:4.4k	N/A	4	human-level response; need process conversation tree to inspect data	apache-2.0
OpenAssistant/oasst2	Click to expand This version of the dataset contains data collected on the open-assistant.io website until Nov 5 2023. Same type of data as oasst1. Data contains message trees, where initial prompt is root node with multiple child nodes as different replies, representing different conversation routes.	train:129k val:6.6k	N/A	4	human-level response; need process conversation tree to inspect data	apache-2.0
Salesforce/dialogstudio	Click to expand Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI. Variety of dialogues including: Knowledge-Grounded-Dialogues, Natural-Language-Understanding, Open-Domain-Dialogues, Task-Oriented-Dialogues, Dialogue-Summarization, Conversational-Recommendation-Dialogs	3,994,204	0	2	focus on conversational AI, irrelevant Accuracy: Cannot be viewed online, need to download locally first. Organize dialogues in the form of turn list, including many other auxiliary information	apache-2.0
argilla/magpie-ultra-v0.1	Click to expand synthetically generated dataset for supervised fine-tuning using the new Llama 3.1 (70B-turbo) model together with other Llama models like Llama-Guard-3-5B and Meta-Llama-3.1-8B-Instruct. Includes synthetic difficulty tags, required knowledge info as well. Base instructions generated by Llama-405B, supplementing info generated by 8B Llama models.	50k	4.75	3.5	llama-3.1-40B generated	llama3.1
bigscience/P3	Click to expand Wide variety of NLP tasks including multiple-choice QA, sentiment analysis or natural language inference.	122,127,848	5	3	Responses are short, mostly 1-2 sentences. A LOT of duplicates. Should probably do a lot of additional filtering for this dataset.	apache-2.0
yizhongw/self_instruct	Click to expand The huggingface dataset also includes P3 and Super Natural Instructions data. Self-Instruct is a framework that helps language models improve their ability to follow natural language instructions. It does this by using the model's own generations to create a large collection of instructional data. With Self-Instruct, it is possible to improve the instruction-following capabilities of language models without relying on extensive manual annotation. Mostly in prompt completion format given a passage.	82.6k		3	Human: not GPT-4 level	apache-2.0
meta-llama/Meta-Llama-3.1-8B-Instruct-evals	Click to expand This dataset contains the Meta evaluation result details for Meta-Llama-3.1-8B-Instruct. The dataset has been created from 30 evaluation tasks.	157k		2	Human: not GPT-4 level, llama3 generated on benchmarks!	llama3.1
mosaicml/instruct-v3	Click to expand Each piece of data has a marked source. This is an aggregate dataset comprised of Dolly, HFRLHF (derived from Databricks Dolly) Self-Instruct (Yizhong Wang) and HH (Anthropic Harmless) datasets, combined with Competition Math, Duorc, CoT GSM8k, Qasper, Quality, Summ Screen FD and Spider. Brief prompt template included with every instruction.	train 56.2k test 6.81k		2	not GPT-4 level, irrelevant task	cc-by-sa-3.0
teknium/OpenHermes-2.5	Click to expand Airoboros 2.2 + CamelAI Domain Expert Datasets (Physics, Math, Chemistry & Biology) + Fatidici4K-orca CoT + GPT4 Collective Cognition (09-10-2023 ~ CoT) + Alpaca GPT4 + Evol Instruct 70K && 140K + Glaive Code Assistant + GPT4-LLM + GPTeacher + Medical Tasks + MetaMath 40k + SlimOrca 550K + Platypus + ShareGPT (GPT4-Only) + Unnatural Instructions GPT4	1M			naive mixture of multiple datasets Filtering included removal of OpenAI refusals, disclaimers, and "As an AI" type examples and more	N/A
bilexi/Bitext-customer-support-llm-chatbot-training-dataset	Click to expand The user provides questions, and the response is a prompt from the assistant	26.9k		2	irrelevant (Customer Service)	cdla-sharing-1.0

Safety

Name	Description	Quantity	Accuracy	Relevance	Notes for Quality	License
Anthropic/hh-rlhf(harmless-base)	Click to expand RLHF format, collected by Anthropic's 52B base model, but has many errors and incorrect annotations.	42.5k	2	2	There are many errors in the annotations, many "chosen" responses are still not safe.	mit
Anthropic_HH_Golden	Click to expand RLHF format, Extending the harmless dataset of Anthropic/hh-rlhf, but rewrite the chosen response with gpt-4.	42.5k	5	5		apache-2.0
nvidia/Aegis-AI-Content-Safety-Dataset-1.0	Click to expand The datasets contains prompt, response and safety labels. Prompts are from Antropic's HH-RLHF dataset, and reponses are generated from Mistral-7B-v0.1. The human annotation is high-quality, but the prompts and reponses are concatenated, without clear spliting symbol.	10.8k	5	4		cc-by-4.0

fudp/Post-Training-Data-Flywheel