COIG-CQIA:Quality is All you need for Chinese Instruction Fine-tuning

[ English | 中文 ]

Welcome to the COIG-CQIA project page. COIG-CQIA stands for Chinese Open Instruction Generalist - Quality is All You Need, a high-quality Chinese instruction fine-tuning dataset. This dataset is designed to provide the Chinese NLP community with high-quality and human interaction-aligned instruction fine-tuning data.

Project Overview

Inspired by studies like LIMA: Less Is More for Alignment, COIG-CQIA focuses on creating a dataset from Chinese internet sources including Q&A and articles. These are deeply cleansed, restructured, and manually reviewed to ensure quality, diversity, and relevance.

Updates

  • [2023.12.04] 🎉 Released version 0.1 of the dataset. SFT models fully fine-tuned using v0.1 of the dataset are based on Yi-6B-base and Yi-34B-base.

Models

Leveraging the COIG-CQIA data, we have developed a series of SFT models based on the Yi series.

Model Name Base Model Download Link
CQIA-Yi-6B-v0.1 Yi-6B-base Download
CQIA-Yi-34B-v0.1 Yi-34B-base Download

How to Use

from transformers import AutoModel

Sample Demonstrations

Logical Reasoning

Input:

Response:

Dataset Details

Data Format

{
    "instruction": "Example question or instruction",
    "input": "Supplementary content for the question or instruction",
    "output": "Response to the input",
    "task_type": {
        "major": ["Q&A"],
        "minor": ["Encyclopedic Q&A"]
    },
    "domain": ["Encyclopedia", "Maternal and Child Health"],
    "answer_from": "human",
    "human_verified": true,
    "copyright": "Copyright information including author details...",
}

Data Fields

  • instruction: The command or question for input.
  • input: Supplementary content for the instruction or question.
  • output: The corresponding response.
  • task_type: The main and sub-task types the data belongs to.
  • domain: The field to which the data belongs.
  • answer_from: Whether the response is written by humans or generated by large models (with human verification).
  • human_verified: Indicates if the data has been verified by humans.
  • copyright: Information about the data's copyright, including the author.

Dataset Breakdown

Social Media&Forum
Category Quantity Source Construction Method
Zhihu 8837 [Website] Multi-stage filtering and human verification.
Douban 3132 [Website] Manually-written prompt templates.
Xiaohongshu 1508 [Website] Manually-written prompt templates.
Segmentfault 458 [Website] Rule-based method for cleaning and filtering, followed by manual review.
Total 13935 - -
Encyclopedia
Category Quantity Source Construction Method
Encyclopedic Article 980 Collected from the internet[Website] [Website] [Website] [Website] Rule-based method for cleaning and filtering, followed by manual review.
Encyclopedia of China 1706 [Website] Manually-written prompt templates.
wikiHow-zh 1876 [Website] & [Open Dataset] Rule-based method for cleaning and filtering.
Total 4571 - -
General NLP tasks
Category Quantity Source Construction Method
COIG-PC-Core 3000 [Open Dataset] Manual review of question quality.
总量 3000 - -
Examinations&Quiz
Category Quantity Source Construction Method
The Chinese National College Entrance Examination&Middle School Entrance Examinations 2000 [Open Dataset] -
Nationwide Master's Program Unified Admissions Examination 475 Collected from the internet Rule-based method for cleaning and filtering.
Logical Reasoning 422 Collected from the internet Rule-based method for cleaning and filtering.
Total 2897 - -
Human value
Category Quantity Source Construction Method
100poison 906 [Open Dataset] -
COIG-human-value 101 [Open Dataset] Manual review of question quality
Total 1007 - -
Traditional Chinese Culture
Category Quantity Source Construction Method
Traditional Knowledge Quiz 232 Collected from the internet Rule-based method for cleaning and filtering, followed by manual review.
Chinese Idiom 112 [Open Dataset] Rule-based method for cleaning and filtering, followed by manual review.
Classical Chinese Poetry Writing 47 [Open Dataset] Rule-based method for cleaning and filtering, followed by manual review.
Classical Chinese Translation 112 [Open Dataset] Rule-based method for cleaning and filtering, followed by manual review.
Total 1112 - -
Finance&Economy Management
Category Quantity Source Construction Method
MBA Encyclopedia 10689 [Website] Manually-written prompt templates.
Finance NLP tasks 600 [Open Dataset] Manual review of question quality.
Total 12689 - -
Medical
Category Quantity Source Construction Method
Medical Encyclopedia 8351 [Website] Manually-written prompt templates.
Medical Articles 186 [Website][Website] Rule-based method for cleaning and filtering.
Total 8537 - -
Law
Category Quantity Source Construction Method
Nationwide Master's Program Unified Admissions Examination 2645 Collected from the internet Rule-based method for cleaning and filtering.
Total 2645 - -

Citation

To cite COIG-CQIA in your work, please use the following format:

@misc{COIG-CQIA,
  author = {},
  title = {COIG-CQIA: Quality is All you need for Chinese Instruction Fine-tuning},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/paralym/COIG-CQIA}},
}

Additional relevant citations:

@article{zhang2023chinese,
  title={Chinese open instruction generalist: A preliminary release},
  author={Zhang, Ge and Shi, Yemin and Liu, Ruibo and Yuan, Ruibin and Li, Yizhi and Dong, Siwei and Shu, Yu and Li, Zhaoqun and Wang, Zekun and Lin, Chenghua and others},
  journal={arXiv preprint arXiv:2304.07987},
  year={2023}
}
@misc{Firefly,
  author = {Jianxin Yang},
  title = {Firefly(流萤): 中文对话式大语言模型},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/yangjianxin1/Firefly}},
}
@misc{xu2023cvalues,
  title={CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility}, 
  author={Guohai Xu and Jiayi Liu and Ming Yan and Haotian Xu and Jinghui Si and Zhuoran Zhou and Peng Yi and Xing Gao and Jitao Sang and Rong Zhang and Ji Zhang and Chao Peng and Fei Huang and Jingren Zhou},
  year={2023},
  eprint={2307.09705},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
  }