COIG-CQIA：Quality is All you need for Chinese Instruction Fine-tuning

[ English | 中文 ]

Welcome to the COIG-CQIA project page. COIG-CQIA stands for Chinese Open Instruction Generalist - Quality is All You Need, a high-quality Chinese instruction fine-tuning dataset. This dataset is designed to provide the Chinese NLP community with high-quality and human interaction-aligned instruction fine-tuning data.

Project Overview

Inspired by studies like LIMA: Less Is More for Alignment, COIG-CQIA focuses on creating a dataset from Chinese internet sources including Q&A and articles. These are deeply cleansed, restructured, and manually reviewed to ensure quality, diversity, and relevance.

Updates

[2023.12.04] 🎉 Released version 0.1 of the dataset. SFT models fully fine-tuned using v0.1 of the dataset are based on Yi-6B-base and Yi-34B-base.

Models

Leveraging the COIG-CQIA data, we have developed a series of SFT models based on the Yi series.

Model Name	Base Model	Download Link
CQIA-Yi-6B-v0.1	Yi-6B-base	Download
CQIA-Yi-34B-v0.1	Yi-34B-base	Download

How to Use

from transformers import AutoModel

Sample Demonstrations

Logical Reasoning

Input:

Response:

Dataset Details

Data Format

{
    "instruction": "Example question or instruction",
    "input": "Supplementary content for the question or instruction",
    "output": "Response to the input",
    "task_type": {
        "major": ["Q&A"],
        "minor": ["Encyclopedic Q&A"]
    },
    "domain": ["Encyclopedia", "Maternal and Child Health"],
    "answer_from": "human",
    "human_verified": true,
    "copyright": "Copyright information including author details...",
}

Data Fields

instruction: The command or question for input.
input: Supplementary content for the instruction or question.
output: The corresponding response.
task_type: The main and sub-task types the data belongs to.
domain: The field to which the data belongs.
answer_from: Whether the response is written by humans or generated by large models (with human verification).
human_verified: Indicates if the data has been verified by humans.
copyright: Information about the data's copyright, including the author.

Dataset Breakdown

Social Media&Forum

Category	Quantity	Source	Construction Method
Zhihu	8837	[Website]	Multi-stage filtering and human verification.
Douban	3132	[Website]	Manually-written prompt templates.
Xiaohongshu	1508	[Website]	Manually-written prompt templates.
Segmentfault	458	[Website]	Rule-based method for cleaning and filtering, followed by manual review.
Total	13935	-	-

Encyclopedia

Category	Quantity	Source	Construction Method
Encyclopedic Article	980	Collected from the internet[Website] [Website] [Website] [Website]	Rule-based method for cleaning and filtering, followed by manual review.
Encyclopedia of China	1706	[Website]	Manually-written prompt templates.
wikiHow-zh	1876	[Website] & [Open Dataset]	Rule-based method for cleaning and filtering.
Total	4571	-	-

General NLP tasks

Category	Quantity	Source	Construction Method
COIG-PC-Core	3000	[Open Dataset]	Manual review of question quality.
总量	3000	-	-

Examinations&Quiz

Category	Quantity	Source	Construction Method
The Chinese National College Entrance Examination&Middle School Entrance Examinations	2000	[Open Dataset]	-
Nationwide Master's Program Unified Admissions Examination	475	Collected from the internet	Rule-based method for cleaning and filtering.
Logical Reasoning	422	Collected from the internet	Rule-based method for cleaning and filtering.
Total	2897	-	-

Human value

Category	Quantity	Source	Construction Method
100poison	906	[Open Dataset]	-
COIG-human-value	101	[Open Dataset]	Manual review of question quality
Total	1007	-	-

Traditional Chinese Culture

Category	Quantity	Source	Construction Method
Traditional Knowledge Quiz	232	Collected from the internet	Rule-based method for cleaning and filtering, followed by manual review.
Chinese Idiom	112	[Open Dataset]	Rule-based method for cleaning and filtering, followed by manual review.
Classical Chinese Poetry Writing	47	[Open Dataset]	Rule-based method for cleaning and filtering, followed by manual review.
Classical Chinese Translation	112	[Open Dataset]	Rule-based method for cleaning and filtering, followed by manual review.
Total	1112	-	-

Finance&Economy Management

Category	Quantity	Source	Construction Method
MBA Encyclopedia	10689	[Website]	Manually-written prompt templates.
Finance NLP tasks	600	[Open Dataset]	Manual review of question quality.
Total	12689	-	-

Medical

Category	Quantity	Source	Construction Method
Medical Encyclopedia	8351	[Website]	Manually-written prompt templates.
Medical Articles	186	[Website][Website]	Rule-based method for cleaning and filtering.
Total	8537	-	-

Law

Category	Quantity	Source	Construction Method
Nationwide Master's Program Unified Admissions Examination	2645	Collected from the internet	Rule-based method for cleaning and filtering.
Total	2645	-	-

Citation

To cite COIG-CQIA in your work, please use the following format:

@misc{COIG-CQIA,
  author = {},
  title = {COIG-CQIA: Quality is All you need for Chinese Instruction Fine-tuning},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/paralym/COIG-CQIA}},
}

Additional relevant citations:

@article{zhang2023chinese,
  title={Chinese open instruction generalist: A preliminary release},
  author={Zhang, Ge and Shi, Yemin and Liu, Ruibo and Yuan, Ruibin and Li, Yizhi and Dong, Siwei and Shu, Yu and Li, Zhaoqun and Wang, Zekun and Lin, Chenghua and others},
  journal={arXiv preprint arXiv:2304.07987},
  year={2023}
}
@misc{Firefly,
  author = {Jianxin Yang},
  title = {Firefly(流萤): 中文对话式大语言模型},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/yangjianxin1/Firefly}},
}
@misc{xu2023cvalues,
  title={CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility}, 
  author={Guohai Xu and Jiayi Liu and Ming Yan and Haotian Xu and Jinghui Si and Zhuoran Zhou and Peng Yi and Xing Gao and Jitao Sang and Rong Zhang and Ji Zhang and Chao Peng and Fei Huang and Jingren Zhou},
  year={2023},
  eprint={2307.09705},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
  }