A collection of AWESOME datasets for instruction tuning.
There is a trend of improving LLM (large language models) by fine-tuning with instructions. As data-centric AI is becoming more and more popular, we need better quality datasets to train our models. This repository is a collection of datasets for instruction tuning.
Note that dataset generated by calling OpenAI API cannot be used to develop models that compete with OpenAI due to OpenAI's terms of use 2(c)(iii).
An awesome list of foundation models can be found here.
-
Size: 52K (English) Source: self-instruct from 175 seed instructions by OpenAI API Cost: less than US$ 500 License: CC By NC 4.0; OpenAI terms of use
-
Size: 104K (English, Chinese) Source: self-instruct from 429 seed instructions collected from the Internet Cost: US$ 880 License: Research only; OpenAI terms of use
-
Self-instruct dataset
Size: 98K (English, Simplified Chinese, Traditional Chinese HK & TW, Janpanese) Source: self-instruct from 175 translated seed instructions of Alpaca Dataset Cost: US$ 6K License: GPL-3.0; OpenAI terms of use
Chat dataset
Size: 49K
Close-QA: give a passage and a question, generate the answer.
Size: 99K
Close-Question: give a passage, raise proper questions.
Size: 107K
-
Self-instruct dataset
Size: 1.5M (Chinese) Source: self-instruct from 175 translated seed instructions of Alpaca Dataset License: Research only; OpenAI terms of use
Math dataset
Size: 250K (Chinese)
Multi-turn chat dataset
Size: 800K (Chinese)
-
Size: 75K Source: FLAN Chain-of-Thought dataset
-
Size: 43M Source: Collected by Together, LAION, and Ontocord.ai.
-
Size: 806K Source: filtered from subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API.
-
Chat dataset
Size: 107K Source: role-playing between AIs (Open AI API)
-
Size: 75K (Unavailable due to privacy concern) Source: ShareGPT
-
Size: 52K (Portuguese) Source: translated from Alpaca Data Cost: US$ 8 License: CC By NC 4.0; OpenAI terms of use
-
Size: 52K (Japanese) Source: translated from Alpaca Data by ChatGPT API. Cost: US$ 45 License: CC By NC 4.0; OpenAI terms of use
-
Size: 52K (Chinese) Source: translated from Alpaca Data by ChatGPT API. Cost: US$ 30-45 License: CC By NC 4.0; OpenAI terms of use
-
Size: 52K (Chinese) Source: translated from Alpaca Data by ChatGPT API. Cost: Volunteer License: CC By NC 4.0; OpenAI terms of use