awesome-LLM-toolkit

Collect information about all the necessary tools and dataset required for building a high-quality customized LLM

Core Model/Training Techniques

Data

Name Type Language License Note
mc4 Raw Text multilingual (100+) ODC-By
bloom Raw Text multilingual (46) Depend on data
wmt22 translation multilingual Depend on data
RedPajama Raw Text mostly EN Apache 2.0
WuDaoCorpora Raw Text zh-CN 5TB
The Stack Code
The Flan Collection Instruction
openwebtext Raw Text EN CC0 1.0 used to train GPT-2
self-instruct-seed Instruction EN Apache 2.0
Stanford Alpaca Instruction EN CC BY-NC 4.0
Alpaca Cleaned Instruction EN CC BY-NC 4.0
ShareGPT Vicuna Instruction EN 🤐 Collected from sharegpt
evol_instruct_70k Instruction EN CC BY-NC ? Generated by Evol-Instruct
HH-RLHF RLHF EN MIT
databricks-dolly-15k Instruction EN CC BY-SA 3.0
GuanacoDataset Instruction EN, zh-CN, zh-TW, JA, DE GPL 3.0 desgined for multilingual
dolly_hhrlhf Instruction EN CC BY-SA 3.0 MosaicAI's filtered version of HH-RLHF and databricks-dolly-15k
HC3, Chinese RLHF EN, CN CC BY-SA 3.0 Paper, Github
Lamini Instruction EN CC-BY-4.0
the_pile_books3 Raw Text mostly EN MIT part of the pile
CoNaLa Coding EN MIT
GPTeacher Instruction EN MIT Generated by GPT-4
Alpaca-CoT Instruction EN, CN Apache 2.0 Collection of instruction datasets
OpenAssistant/oasst1 Conversation multilingual Apache 2.0 Paper, Collected from Open Assistant

Others

Evaluation

Serving

Community

Others