Large language models (LLMs), such as OpenAI's GPT series, Google's Bard, and Baidu's Wenxin Yiyan, are driving profound technological changes. Recently, with the emergence of open-source large model frameworks like LlaMa and ChatGLM, training an LLM is no longer the exclusive domain of resource-rich companies. Training LLMs by small organizations or individuals has become an important interest in the open-source community, with some notable works including Alpaca, Vicuna, and Luotuo. In addition to large model frameworks, large-scale and high-quality training corpora are also essential for training large language models. Currently, relevant open-source corpora in the community are still scattered. Therefore, the goal of this repository is to continuously collect high-quality training corpora for LLMs in the open-source community.
Training a chatbot LLM that can follow human instruction effectively requires access to high-quality datasets that cover a range of conversation domains and styles. In this repository, we provide a curated collection of datasets specifically designed for chatbot training, including links, size, language, usage, and a brief description of each dataset. Our goal is to make it easier for researchers and practitioners to identify and select the most relevant and useful datasets for their chatbot LLM training needs. Whether you're working on improving chatbot dialogue quality, response generation, or language understanding, this repository has something for you.
- IFT: Instruction Finetune
- DFT: Dialog Finetune
- PT: pretrain
- CoT: Chain-of-Thought Finetune
- RLHF: train reward model in Reinforcement Learning with Human Feedback
Dataset name | Used by | Used for | Language | Size | Description |
---|---|---|---|---|---|
h2oai/h2ogpt-fortune2000-personalized | h2ogpt | IFT | English | 11363 entries | A instruction finetune developed by h2oai, covered various topics. |
SHP | StableVicuna, chat-opt, , SteamSHP |
RLHF | English | 385K entries | An RLHF dataset different from previously mentioned ones, it use scores+timestamps to infer the users' preferences. Covers 18 domains, collected by Stanford. |
ELI5 | MiniLM series | FT, RLHF |
English | 270K entries | Questions and Answers collected from Reddit, including score. Might be used for RLHF reward model training. |
evol_instruct_70k | WizardLM | IFT | English | An instruction finetune dataset derived from Alpaca-52K, using the evolution method in this paper | |
MOSS SFT data | MOSS | IFT, DFT |
Chinese, English | 1.1M entries | A conversational dataset collected and developed by MOSS team. It has usefulness, loyalty and harmlessness labels for every data entries. |
ShareGPT52K | Koala, Stable LLM | IFT | Multilingual | 52K | This dataset comprises conversations collected from ShareGPT, with a specific focus on customized creative conversation. |
GPT-4all Dataset | GPT-4all | IFT | English, Might have a translated version |
400k entries | A combination of some subsets of OIG, P3 and Stackoverflow. Covers topics like general QA, customized creative questions. |
COIG | / | IFT | Chinese, code |
200K entries | A Chinese-based dataset. It contains domains like general purpose QA, Chinese exams, code. Its quality is checked by human annotators. |
RedPajama-Data-1T | RedPajama | PT | Primarily English | 1.2T tokens 5TB |
A fully open pretraining dataset follows the LLaMA's method. |
OpenAssistant Conversations Dataset (OASST1) | OpenAssistant | IFT, DFT |
Multilingual (English, Spanish, etc.) |
66,497 conversation trees | A large, human-written, human-annotated high quality conversation dataset. It aims at making LLM generates more natural response. |
Alpaca-COT | Phoenix | IFT, DFT, CoT |
English | / | A mixture a many dataset like classic Alpaca dataset, OIG, Guanaco and some CoT(Chain-of-Thought) datasets like FLAN-CoT. May be handy to use. |
CBook-150K | / | PT, building dataset |
Chinese | 150K+ books | A raw Chinese books dataset. Need some preprocess pipeline. |
databricks-dolly-15k | Dolly2.0 | IFT | English | 15K+ entries | A dataset of human-written prompts and responses, featuring tasks such as open-domain question-answering, brainstorming, summarization, and more. |
AlpacaDataCleaned | Some Alpaca/ LLaMA-like models | IFT | English | / | Cleaned version of Alpaca, GPT_LLM and GPTeacher. |
GPT-4-LLM Dataset | Some Alpaca-like models | IFT, RLHF |
English, Chinese |
52K entries for English and Chinese respectively 9K entries unnatural-instruction |
NOT the dataset used by GPT-4!! It is generated by GPT-4 and some other LLM for better IFT and RLHF. It includes instruction data as well as comparison data in RLHF style. |
GPTeacher | / | IFT | English | 20k entries | A dataset contains targets generated by GPT-4 and includes many of the same seed tasks as the Alpaca dataset, with the addition of some new tasks such as roleplay. |
HC3 | Koala | RLHF | English, Chinese |
24322 English 12853 Chinese |
A multi-domain, human-vs-ChatGPT comparison dataset. Can be used for reward model training or ChatGPT detector training. |
Alpaca data Download |
Alpaca, ChatGLM-finetune-LoRA, Koala | DFT, IFT |
English | 52K entries 21.4MB |
A dataset generated by text-davinci-003 to improve language models' ability to follow human instruction. |
OIG OIG-small-chip2 |
Pythia-Chat-Base-7B, GPT-NeoXT-Chat-Base-20B, Koala | DFT, IFT |
English, code |
44M entries | A large conversational instruction dataset with medium and high quality subsets (OIG-small-chip2) for multi-task learning. |
ChatAlpaca data | / | DFT, IFT |
English, Chinese version coming soon |
10k entries 39.5MB |
A dataset aims to help researchers develop models for instruction-following in multi-turn conversations. |
InstructionWild | ColossalChat | IFT | English, Chinese | 10K enreues | A Alpaca-style dataset, but with seed tasks comes from chatgpt screenshot. |
Firefly(流萤) | Firefly(流萤) | IFT | Chinese | 1.1M entries 1.17GB |
A Chinese instruction-tuning dataset with 1.1 million human-written examples across 23 tasks, but no conversation. |
BELLE 0.5M version 1M version 2M version |
BELLE series, Chunhua (春华) | IFT | Chinese | 2.67B in total | A Chinese instruction dataset similar to Alpaca data constructed by generating answers from seed tasks, but no conversation. |
GuanacoDataset | Guanaco | DFT, IFT |
English, Chinese, Japanese |
534,530 entries | A multilingual instruction dataset for enhancing language models' capabilities in various linguistic tasks, such as natural language understanding and explicit content recognition. |
xP3 (and some variant) | BLOOMZ, mT0 | IFT | Multilingual, code |
79M entries 88GB |
An instruction dataset for improving language models' generalization ability, similar to Natural Instruct. |
OpenAI WebGPT | WebGPT's reward model, Koala | RLHF | English | 19,578 pairs | Data set used in WebGPT paper. Used for training reward model in RLHF. |
OpenAI Summarization Comparison | Koala | RLHF | English | ~93K entries 420MB |
A dataset of human feedback which helps training a reward model. The reward model was then used to train a summarization model to align with human preferences. |
Natural Instruction GitHub&Download |
tk-instruct series | IFT, evaluation |
Multilingual | / | A benchmark with over 1,600 tasks with instruction and definition for evaluating and improving language models' multi-task generalization under natural language instruction. |
hh-rlhf on Huggingface |
Koala | RLHF | English | 161k pairs 79.3MB |
A pairwise dataset for training reward models in reinforcement learning for improving language models' harmlessness and helpfulness. |
Common Crawl | LLaMA (After some process) | building datasets, PT |
/ | / | The most well-known raw dataset, rarely be used directly. One possible preprocess pipeline is CCNet |
nlp_Chinese_Corpus | / | PT, TF |
Chinese | / | A Chinese pretrain corpus. Includes Wikipedia, Baidu Baike, Baidu QA, some forums QA and news corpus. |
The Pile (V1) | GLM (partly), LLaMA (partly), GPT-J, GPT-NeoX-20B, Cerebras-GPT 6.7B, OPT-175b | PT | Multilingual, code |
825GB | A diverse open-source language modeling dataset consisting of 22 smaller, high-quality datasets that includes many domains and tasks. |
C4 Huggingface dataset TensorFlow dataset |
Google T5 Series, LLaMA | PT | English | 305GB | A colossal, cleaned version of Common Crawl's web crawl corpus. Frequently be used. |
ROOTS | BLOOM | PT | Multilingual, code |
1.6TB | A diverse open-source dataset consisting of sub-datasets like Wikipedia and StackExchange for language modeling. |
Pushshift reddit paper |
OPT-175b | PT | / | / | Raw reddit data, one possible processing pipeline in this paper |
Gutenberg project | LLaMA | PT | Multilingual | / | A book dataset, mostly novels. Not be preprocessed. |
CLUECorpus | / | PT, finetune, evaluation |
Chinese | 100GB | A Chinese pretraining Corpus sourced from Common Crawl. |
We consider row items as subject.
OIG | hh-rlhf | xP3 | natural instruct | AlpacaDataCleaned | GPT-4-LLM | Alpaca-CoT | |
---|---|---|---|---|---|---|---|
OIG | / | contains | overlap | overlap | overlap | overlap | |
hh-rlhf | part of | / | overlap | ||||
xP3 | overlap | / | overlap | overlap | |||
natural instruct | overlap | overlap | / | overlap | |||
AlpacaDataCleaned | overlap | / | overlap | overlap | |||
GPT-4-LLM | overlap | / | overlap | ||||
Alpaca-CoT | overlap | overlap | overlap | overlap | overlap | overlap | / |
Dataset name | Used by | Used for | Language | Size | Description |
---|---|---|---|---|---|
finance-alpaca | / | IFT | English | 1.3K entries | An Alpaca-style dataset but focus on financial topics |
Dataset name | Used by | Used for | Language | Size | Description |
---|---|---|---|---|---|
ShareGPT-70K | Vicuna | Instruction fintune | / | 70K entries | Data shared by user on ShareGPT |
WebText(Reddit links) | GPT-2 | PT | English | / | Data crawled from Reddit and filtered for GPT-2 pretraining. |
MassiveText | Gopher, Chinchilla | PT | 99% English, 1% other(including code) | ||
WuDao(悟道) Corpora | GLM | PT | Chinese | 200GB | A large scale Chinese corpus, Possible component originally open-sourced but not available now. |