We aim to provide the best references to search, select, and synthesize high-quality and large-quantity data for post-training your LLMs.
There are three contributions in this repository:
- Data Generation: We provide the data generation process for two important domains: instruction following and function calling.
- Dataset Compilation: We collected and compiled a list of high-quality datasets for post-training LLMs in the domains of instruction following, coding, and math. We provide a quality check for the datasets.
- Dataset Curation: According to the quality check, we carefully curated a new dataset for post-training LLMs. The datasets are carefully collected and evaluated to ensure high quality and relevance for post-training.
Disclaimer: Although we provide the license information below, this information is obtained from the original repository. However, we also noticed that some datasets, while claimed to be open, actually utilize commercial models. Therefore, please double-check carefully before using them, especially if you intend to use them for commercial purposes or something similar.
You can download the datasets directly from the Hugging Face Hub. There are two versions:
- Flywheel-v1: A small and highly curated datasets.
- Flywheel-v2: A large and diverse datasets. (recommended)
We provide the data generation process for two important domains: instruction following and function calling.
- Domain: we are only concerned about the following tasks: instruction following, coding, and math. Datasets other than those in English are not considered.
- Data source: only keep GPT-4 generated data. Drop inferior data sources (gpt-3.5-turbo).
- popular dataset, download > 1K.
- Accuracy (%): randomly sample 20 for the instruction tuning dataset and 10 for other domains. Check the quality manually and provide quality signal = x / 20
- Relevance Score (1-5):
- 5: Directly corresponds to one of [IFEval*, MTBench, AGIEval*, AlpacaEval, …] (Overfitting)
- 4: Generally have instruction following format and GPT-4 / human level response.
- 3: Most have instruction following format and correct response.
- 2: Have major flaws (e.g. irrelevant) but may be useful
- 1: low quality or potentially harmful impact
Name | Description | Quantity | Accuracy | Relevance | Notes for Quality | License |
---|---|---|---|---|---|---|
glaiveai/glaive-function-calling-v2 | Click to expandNo duplicate in first 10. Wide variety of tasks. |
113K | 4.5 | 4.5 | apache-2.0 | |
Salesforce/xlam-function-calling-60k | Click to expandAnswers are function names and parameter lists. Contains functions with ambiguous parameter types and trivial functions |
60K | 5 | 4.5 | cc-by-4.0 | |
Gorilla OpenFunctions-v2 | Click to expandGitHub JSON format data, no Hugging Face dataset. Uses AST to determine if API calls are correct |
17K | 5 | 5 | Apache-2.0 | |
NousResearch/hermes-function-calling-v1 | Click to expandFunction calling split contains tool call and response, compared to the singleturn split. |
1893 | 4.5 | apache-2.0 |
Name | Description | Quantity | Accuracy | Relevance | Notes for Quality | License |
---|---|---|---|---|---|---|
ise-uiuc/Magicoder-OSS-Instruct-75K | Click to expandQuestion 1 gives task, inputs, constraints, example (Leetcode style), question 2 gives method signature, question 3 gives just problem description |
75.2K | 4.5 | 3.5 | mit | |
RLHFlow/CodeUltraFeedback-standard | Click to expandRLHF format, including chosen and rejected, The total chosen-rejected pairs are 50156 while the unique chosen answers are around 38.4K |
38.4k/50.2k (see notes) | 4 | 4 | Sizes are unique chosen answers and total chosen-rejected pairs, respectively | mit |
codeparrot/apps | Click to expandCompetitive Programming (Codeforces) style prompts with inputs, constraints, examples, descriptions. Includes separate test cases. Sometimes method signature provided. Relatively complicated. Items too long to check. |
10K | N/A | N/A | mit | |
iamtarun/python_code_instructions_18k_alpaca | Click to expandPrompts with supplied examples sometimes. Even with supplied examples, models only sometimes give the corresponding output |
18.6K | 5 | 4 | N/A | |
bigcode/self-oss-instruct-sc2-exec-filter-50k | Click to expandFinal self-alignment training dataset for StarCoder2-Instruct. |
50.7k | odc-by | |||
theblackcat102/evol-codealpaca-v1 | Click to expandSimilar to ise-uiuc/Magicoder-Evol-Instruct-110K. |
111k | apache-2.0 |
Name | Description | Quantity | Accuracy | Relevance | Notes for Quality | License |
---|---|---|---|---|---|---|
meta-math/MetaMathQA | Click to expandQuestion 4 does not provide public machine expressions. Original questions sometimes rewritten to be parameterized. |
395k | 4.75 | 4.5 | mit | |
MathInstruct | Click to expandContains 13 datasets, such as camel math, etc. We examined the first 10 questions. Question 2 does have candidates and the answer is correct Question 3 does not provide a specific answer Question 5 does not provide a specific answer Most do not provide specific answers |
262K | 4.5 | 3 | mit | |
camel-ai/math | Click to expandDataset is composed of 50K problem-solution pairs obtained using GPT-4 |
50k | 5 | 4.5 | cc-by-nc-4.0 | |
xinlai/Math-Step-DPO-10K | Click to expandRLHF format, including chosen and rejected. Use step-by-step prompt.initial_reason_steps includes preliminary calculation and hints. |
10.8k | 4.5 | 3.5 | cc-by-nc-4.0 | |
openai/gsm8k | Click to expandCommonly used for many benchmarks, including the LLM Leaderboard.Answer includes <> formated calculation. |
train 7.47k test 1.32K | 5 | 4.5 | mit |
Name | Description | Quantity | Accuracy | Relevance | Notes for Quality | License |
---|---|---|---|---|---|---|
Open-Orca/1million-gpt-4 | Click to expandFLAN collection which has been augmented by submitting the listed question to GPT-4. Many questions supply a passage as context. |
1M | 5 | 4 | N/A | |
Open-Orca/SlimOrca | Click to expandThis release provides an efficient means of reacting our OpenOrca dataset with using larger slices of our data, while only including ~500k GPT-4 completions. Many questions supply a passage as context |
518k | 5 | 4 | mit | |
teknium/GPT4-LLM-Cleaned | Click to expandInstruction-Following Data generated by GPT-4 using Alpaca prompts. Separated into main instruction, with optional accompanying parameter. E.g. "instruction": "what does this code do?", " input":"def function()" |
54.6k | 5 | 4 | apache-2.0 | |
databricks/databricks-dolly-15k | Click to expandDolly2.0 (Pairs, English, 15K+ entries) — A dataset of human-written prompts and responses, featuring tasks like question-answering and summarization. Categorized questions, e.g. "closed_qa", "classification", "open_qa", etc. Sometimes an optional "context" parameter is supplied. |
15k | 5 | 4 | cc-by-sa-3.0 | |
allenai/WildChat-1M (GPT4-EN) | Click to expand1 million conversations between human users and ChatGPT. 25.53% of the conversations come from the GPT-4 chatbot, while the rest come from the GPT-3.5 chatbot. Contains accompanying scores/classifications on various categories of harmfulness, e.g. "harassment", "self-harm", etc. Many non-English entries. |
168k | 4 | 5 | filter gpt-4-en. Size refers to gpt-4 entries only | odc-by |
sablo/oasst2_curated | Click to expandA filtered and curated dataset taken from the top scoring OpenAssistant/oasst2 conversations. Saved in HF Chat format. The result is a high quality dataset for SFT. |
train 4.69k, test 24 | 5 | 4 | open-ended conversation, human annotated | apache-2.0 |
CollectiveCognition/chats-data-2023-09-22 | Click to expandCollection of chats between users and the ChatGPT model. These conversations have been shared by users on the "Collective Cognition" website. Includes ChatGPT generated conversation titles. |
156 | 4.75 | 4 | Human: after filter out GPT-4 | mit |
lmsys/lmsys-chat-1m | Click to expandone million real-world conversations with 25 state-of-the-art LLMs. Includes conversation topics with model tags, language, harmfulness ratings across multiple axes, and PII redaction. Many non-English prompts. |
1M | 4.5 | 4 | Human: after filter out GPT-4 | LMSYS-Chat-1M Dataset License |
teknium/GPTeacher-General-Instruct | Click to expandGPT-4 Generated self-instruct dataset. Mix of open/closed qa, rewriting, answering questions based on supplied passage. |
89.3k | 4.5 | 4 | gpt-4 generated | mit |
stingning/ultrachat | Click to expandSome data inside the 774K are very long, basically exceeding 10000 in length. Questions and responses combined into one field. |
774k | 4.5 | 4 | Human: The dialogue is a list of strings chatgpt generated with human refinements | mit |
jondurbin/airoboros-3.2 | Click to expandmodified self-instruct gpt4. Contains some harmful/toxic content. |
58,709 | 4.5 | 4 | Accuracy: Errors in mathematical calculations. Data was generated primarily with gpt-4 | cc-by-4.0 |
openbmb/UltraInteract_sft | Click to expanda large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. |
289K | 4 | 5 | specifically for reasoning | mit |
AutoIF | Click to expandSynthetic dataset that matches IFEval, no open source download available. Restrictions on output format, length. E.g. 50 words, 5 sentences, 4-syllable words, palindromes. Strong emphasis on conciseness. |
N/A | N/A | N/A | hack IFEval to generate data | apache-2.0 |
WizardLM/WizardLM_evol_instruct_V2_196k | Click to expandoriginal wizard lm data |
143k | 4.5 | 3 | Human: Some errors; gpt-3.5-turbo generated; evolved from mixture of Alpaca and ShareGPT | mit |
TIGER-Lab/WebInstructSub | Click to expandvast amounts of high-quality instruction data exist in the web corpus, spanning various domains like math and science. Specifically contains data from mathstackexchange, stackexchange, and socratic. |
2.34M | 5 | 3 | Human: not relevant | apache-2.0 |
allenai/soda | Click to expandDialogue dataset covering a wide range of social interactions. |
train 1.19M validation 146k test 149k | 5 | 3 | Accuracy: Discrepancy in the amount of dialogue and conversation data. Dialogue contains proper_name information. Human: not GPT-4 level | cc-by-4.0 |
nvidia/Daring-Anteater | Click to expandconsisting of 100k conversations, each averaging 2.88 model turns, generated using NVIDIA proprietary model and Mistral-8x7B-Instruct-v0.1, while the remaining samples are sourced from FinQA, wikitablequestions, and commercially-friendly subsets of Open-Platypus |
99.5k | 5 | 3 | Human: from NVIDIA proprietary models and Mistral-8x7B-Instruct-v0.1 not GPT-4 | cc-by-4.0 |
yahma/alpaca-cleaned | Click to expandSome Alpaca/ LLaMA-like models (Pairs, English) — Cleaned version of Alpaca, GPT_LLM, and GPTeacher. Cleaned to correct: hallucinations, merged instructions, empty outputs, empty code examples, instructions to generate images, N/A outputs, wrong answers (?), non-sensical/unclear instructions, extra escape and control characters |
52k | 5 | 3 | Should review some of the choices for cleaning data. | cc-by-4.0 |
tatsu-lab/alpaca | Click to expandChatGLM-fine-tune-LoRA; Koala (Dialog, Pairs, English, 52K entries, 21.4MB) — A dataset generated by text-davinci-003 to enhance language models' ability to follow human instruction. Contains instruction field (all unique), optional input in ~40% of data, model output, and finally a formatted combination following a prompt template. |
52k | 4.5 | 3 | cc-by-nc-4.0 | |
cascip/ChatAlpaca | Click to expanduse ChatGPT (GPT-3.5-turbo) to generate follow-up utterances and continue the conversation with ChatGPT |
20k | 4 | 3 | apache-2.0 | |
philschmid/guanaco-sharegpt-style | Click to expandSome code content, mostly general conversations. Mostly non-English |
9.03k | 3 | 3 | Accuracy: Many foreign languages. Human: After filtering, a high-quality GPT4 daily Q&A dataset, size 6K, mainly knowledge Q&A, programming questions, reasoning calculations, including Simplified Chinese, Traditional Chinese, English, Japanese, Korean, and various languages | N/A |
andersonbcdefg/gpt4all | Click to expandQuestions from stackoverflow. Contains HTML tags. |
438k | 3 | 2 | Human: prompt is html coding and math, not relevant to instruction following | N/A |
OpenAssistant/oasst1 | Click to expandThis version of the dataset contains data collected on the open-assistant.io website until April 12 2023. Human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. |
train:84.4k val:4.4k | N/A | 4 | human-level response; need process conversation tree to inspect data | apache-2.0 |
OpenAssistant/oasst2 | Click to expandThis version of the dataset contains data collected on the open-assistant.io website until Nov 5 2023. Same type of data as oasst1. Data contains message trees, where initial prompt is root node with multiple child nodes as different replies, representing different conversation routes. |
train:129k val:6.6k | N/A | 4 | human-level response; need process conversation tree to inspect data | apache-2.0 |
Salesforce/dialogstudio | Click to expandTowards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI. Variety of dialogues including: Knowledge-Grounded-Dialogues, Natural-Language-Understanding, Open-Domain-Dialogues, Task-Oriented-Dialogues, Dialogue-Summarization, Conversational-Recommendation-Dialogs |
3,994,204 | 0 | 2 | focus on conversational AI, irrelevant Accuracy: Cannot be viewed online, need to download locally first. Organize dialogues in the form of turn list, including many other auxiliary information | apache-2.0 |
argilla/magpie-ultra-v0.1 | Click to expandsynthetically generated dataset for supervised fine-tuning using the new Llama 3.1 (70B-turbo) model together with other Llama models like Llama-Guard-3-5B and Meta-Llama-3.1-8B-Instruct. Includes synthetic difficulty tags, required knowledge info as well. Base instructions generated by Llama-405B, supplementing info generated by 8B Llama models. |
50k | 4.75 | 3.5 | llama-3.1-40B generated | llama3.1 |
bigscience/P3 | Click to expandWide variety of NLP tasks including multiple-choice QA, sentiment analysis or natural language inference. |
122,127,848 | 5 | 3 | Responses are short, mostly 1-2 sentences. A LOT of duplicates. Should probably do a lot of additional filtering for this dataset. | apache-2.0 |
yizhongw/self_instruct | Click to expandThe huggingface dataset also includes P3 and Super Natural Instructions data. Self-Instruct is a framework that helps language models improve their ability to follow natural language instructions. It does this by using the model's own generations to create a large collection of instructional data. With Self-Instruct, it is possible to improve the instruction-following capabilities of language models without relying on extensive manual annotation. Mostly in prompt completion format given a passage. |
82.6k | 3 | Human: not GPT-4 level | apache-2.0 | |
meta-llama/Meta-Llama-3.1-8B-Instruct-evals | Click to expandThis dataset contains the Meta evaluation result details for Meta-Llama-3.1-8B-Instruct. The dataset has been created from 30 evaluation tasks. |
157k | 2 | Human: not GPT-4 level, llama3 generated on benchmarks! | llama3.1 | |
mosaicml/instruct-v3 | Click to expandEach piece of data has a marked source. This is an aggregate dataset comprised of Dolly, HFRLHF (derived from Databricks Dolly) Self-Instruct (Yizhong Wang) and HH (Anthropic Harmless) datasets, combined with Competition Math, Duorc, CoT GSM8k, Qasper, Quality, Summ Screen FD and Spider. Brief prompt template included with every instruction. |
train 56.2k test 6.81k | 2 | not GPT-4 level, irrelevant task | cc-by-sa-3.0 | |
teknium/OpenHermes-2.5 | Click to expandAiroboros 2.2 + CamelAI Domain Expert Datasets (Physics, Math, Chemistry & Biology) + Fatidici4K-orca CoT + GPT4 Collective Cognition (09-10-2023 ~ CoT) + Alpaca GPT4 + Evol Instruct 70K && 140K + Glaive Code Assistant + GPT4-LLM + GPTeacher + Medical Tasks + MetaMath 40k + SlimOrca 550K + Platypus + ShareGPT (GPT4-Only) + Unnatural Instructions GPT4 |
1M | naive mixture of multiple datasets Filtering included removal of OpenAI refusals, disclaimers, and "As an AI" type examples and more | N/A | ||
bilexi/Bitext-customer-support-llm-chatbot-training-dataset | Click to expandThe user provides questions, and the response is a prompt from the assistant |
26.9k | 2 | irrelevant (Customer Service) | cdla-sharing-1.0 |
Name | Description | Quantity | Accuracy | Relevance | Notes for Quality | License |
---|---|---|---|---|---|---|
Anthropic/hh-rlhf(harmless-base) | Click to expandRLHF format, collected by Anthropic's 52B base model, but has many errors and incorrect annotations. |
42.5k | 2 | 2 | There are many errors in the annotations, many "chosen" responses are still not safe. | mit |
Anthropic_HH_Golden | Click to expandRLHF format, Extending the harmless dataset of Anthropic/hh-rlhf, but rewrite the chosen response with gpt-4. |
42.5k | 5 | 5 | apache-2.0 | |
nvidia/Aegis-AI-Content-Safety-Dataset-1.0 | Click to expandThe datasets contains prompt, response and safety labels. Prompts are from Antropic's HH-RLHF dataset, and reponses are generated from Mistral-7B-v0.1. The human annotation is high-quality, but the prompts and reponses are concatenated, without clear spliting symbol. |
10.8k | 5 | 4 | cc-by-4.0 |