Collection of all language dataset for finetuning LLM
A quick repo to download all the dataset to finetune your LLM. Pull requests are welcome.
Data | Quantity | Source | Description |
---|---|---|---|
AlpacaGPT3.5Customized | 56k | Link | generated from GPT-3.5 and extreme content filtering removed and replace with relevant generated outputs, specifically designed for training Alpaca-like models |
Chinese- English Translation dataset | 500K | Link | Sampled and filtered based on the original dataset |
GPT4all (Without P3) | ~440k | Link | Contains Laion OIG unified chip2, stackoverflow questions and Bigscience/P3 and pruned the P3 dataset |
pCLUE dataset | 300K | Link | Sampled and filtered based on the original dataset |
ShareGPT_Vicuna_unfiltered | 48K | Link | ShareGPT conversations |
Stanford Alpaca dataset (English) | 50K | Link | Original Stanford Alpaca training data |
Stanford Alpaca dataset (Cantonese) | 50K | To be released | Translated from the Chinese version using the ChatGPT interface |
Stanford Alpaca dataset (Chinese) | 50K | Link | Translated from the English version using the ChatGPT interface (with some parts discarded) |
Stanford Alpaca dataset (French) | 50K | Link | Translated from the English version |
Stanford Alpaca dataset (Japanese) | 50K | Link | Translated from the English version |
Laion OIG | ~44M | Link | large instruction dataset of medium quality along with a smaller high quality instruciton dataset |