/awesome-language-dataset

Collection of all language dataset for finetuning LLM

awesome-language-dataset

Collection of all language dataset for finetuning LLM

A quick repo to download all the dataset to finetune your LLM. Pull requests are welcome.

Data Quantity Source Description
AlpacaGPT3.5Customized 56k Link generated from GPT-3.5 and extreme content filtering removed and replace with relevant generated outputs, specifically designed for training Alpaca-like models
Chinese- English Translation dataset 500K Link Sampled and filtered based on the original dataset
GPT4all (Without P3) ~440k Link Contains Laion OIG unified chip2, stackoverflow questions and Bigscience/P3 and pruned the P3 dataset
pCLUE dataset 300K Link Sampled and filtered based on the original dataset
ShareGPT_Vicuna_unfiltered 48K Link ShareGPT conversations
Stanford Alpaca dataset (English) 50K Link Original Stanford Alpaca training data
Stanford Alpaca dataset (Cantonese) 50K To be released Translated from the Chinese version using the ChatGPT interface
Stanford Alpaca dataset (Chinese) 50K Link Translated from the English version using the ChatGPT interface (with some parts discarded)
Stanford Alpaca dataset (French) 50K Link Translated from the English version
Stanford Alpaca dataset (Japanese) 50K Link Translated from the English version
Laion OIG ~44M Link large instruction dataset of medium quality along with a smaller high quality instruciton dataset