awesome-instruction-tunning-datasets

A collection of AWESOME datasets for instruction tuning.

There is a trend of improving LLM (large language models) by fine-tuning with instructions. As data-centric AI is becoming more and more popular, we need better quality datasets to train our models. This repository is a collection of datasets for instruction tuning.

Note that dataset generated by calling OpenAI API cannot be used to develop models that compete with OpenAI due to OpenAI's terms of use 2(c)(iii).

An awesome list of foundation models can be found here.

Datasets

Alpaca Dataset

Size: 52K (English)
Source: self-instruct from 175 seed instructions by OpenAI API
Cost: less than US$ 500
License: CC By NC 4.0; OpenAI terms of use

InstructionWild

Size: 104K (English, Chinese)
Source: self-instruct from 429 seed instructions collected from the Internet
Cost: US$ 880
License: Research only; OpenAI terms of use

Guanaco Dataset

Self-instruct dataset

Size: 98K (English, Simplified Chinese, Traditional Chinese HK & TW, Janpanese)
Source: self-instruct from 175 translated seed instructions of Alpaca Dataset
Cost: US$ 6K
License: GPL-3.0; OpenAI terms of use

Chat dataset

Size: 49K

Close-QA: give a passage and a question, generate the answer.

Size: 99K

Close-Question: give a passage, raise proper questions.

Size: 107K

BELLE

Self-instruct dataset

Size: 1.5M (Chinese)
Source: self-instruct from 175 translated seed instructions of Alpaca Dataset
License: Research only; OpenAI terms of use

Math dataset

Size: 250K (Chinese)

Multi-turn chat dataset

Size: 800K (Chinese)

Alpaca-CoT Dataset

Size: 75K
Source: FLAN Chain-of-Thought dataset

OIG-43M Dataset [preprocess]

Size: 43M
Source: Collected by Together, LAION, and Ontocord.ai.

GPT4All Dataset

Size: 806K
Source: filtered from subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API.

Camel Dataset

Chat dataset

Size: 107K
Source: role-playing between AIs (Open AI API)

~~Vicuna Dataset~~

Size: 75K (Unavailable due to privacy concern)
Source: ShareGPT

Derivative Datasets

Cabrita Dataset

Size: 52K (Portuguese)
Source: translated from Alpaca Data
Cost: US$ 8
License: CC By NC 4.0; OpenAI terms of use

Japanese Alpaca Dataset

Size: 52K (Japanese)
Source: translated from Alpaca Data by ChatGPT API.
Cost: US$ 45
License: CC By NC 4.0; OpenAI terms of use

Chinese Alpaca Dataset

Size: 52K (Chinese)
Source: translated from Alpaca Data by ChatGPT API.
Cost: US$ 30-45
License: CC By NC 4.0; OpenAI terms of use

Alpaca Chinese Dataset

Size: 52K (Chinese)
Source: translated from Alpaca Data by ChatGPT API.
Cost: Volunteer
License: CC By NC 4.0; OpenAI terms of use

tjadamlee/awesome-instruction-tunning-datasets

awesome-instruction-tunning-datasets

Datasets

Derivative Datasets

Dataset used in InstructGPT