BELLE: A Python repository from ranranBi

BELLE: BE Large Language model Engine

本项目目标是促进中文对话大模型开源社区的发展。现阶段本项目基于BLOOM和LLAMA针对中文做了优化，模型调优仅使用由ChatGPT生产的数据（不包含任何其他数据）。

项目包含以下内容:

数据开放：参考Stanford Alpaca 生成的中文数据集1M + 0.5M
基于BLOOMZ-7B1-mt优化后的模型：BELLE-7B-0.2M，BELLE-7B-0.6M，BELLE-7B-1M，BELLE-7B-2M
基于LLAMA优化后的模型：BELLE-LLAMA-7B-0.6M，BELLE-LLAMA-7B-2M

欢迎大家通过issue贡献更多的prompts！

What's New

2023年3月26日: 发布了2M指令数据训练的LLAMA-7B模型，发布了0.6M指令数据训练的LLAMA-7B模型
2023年3月25日: 发布了GPTQ量化后的模型，大大降低了推理的硬件需求，相关代码在gptq目录下
2023年3月25日: 发布了全新的更高质量1M中文数据集，与之前发布的0.5M中文指令数据集不重复
2023年3月20日: 发布了2M指令数据训练的BLOOM-7B模型
2023年3月18日: 发布了1M指令数据训练的BLOOM-7B模型，发布了基于0.6M指令数据训练的BLOOM-7B模型
2023年3月17日: 发布了0.2M指令数据训练的BLOOM-7B模型，发布了0.5M中文指令数据集

概述

为了促进中文对话大模型开源社区的发展，本项目参考Stanford Alpaca项目，优化了中文数据流程，同时选用开源预训练大语言模型（BLOOM或者LLAMA）作为基础模型训练得到了一个可以更好支持中文指令的开源模型 - BELLE（BE Large Language model Engine）。

我们开源基于Alpaca的数据收集代码，基于这段代码生成了约200万条中文数据，结合Alpaca的5万条英文数据，在BLOOMZ-7B模型和LLAMA模型训练得到的checkpoint上传在Hugging Face。

局限性和使用限制

基于当前数据和基础模型训练得到的SFT模型，在效果上仍存在以下问题：

在涉及事实性的指令上可能会产生违背事实的错误回答。
对于具备危害性的指令无法很好的鉴别，由此会产生危害性言论。
在一些涉及推理、代码等场景下模型的能力仍有待提高。

基于以上模型局限性，我们要求开发者仅将我们开源的代码、数据、模型及后续用此项目生成的衍生物用于研究目的，不得用于商业，以及其他会对社会带来危害的用途。

模型发布

我们采取了不同大小规模（20万、60万、100万和200万样本）的指令学习的数据集训练模型，基于BLOOMZ-7B1-mt训练调优后的模型，现已开放:

Datasize	200,000	600,000	1,000,000	2,000,000
Finetuned Model	BELLE-7B-0.2M	BELLE-7B-0.6M	BELLE-7B-1M	BELLE-7B-2M

此外，方便大家使用，也对模型进行了量化基于GPTQ量化后的模型，其中包含4bit和8bit的量化模型。

model name	file size	GPU memory usage
base	27G	~28.2G
bloom7b-2m-8bit-128g.pt	9.7G	~11.4G
bloom7b-2m-4bit-128g.pt	6.9G	~8.4G
bloom7b-0.2m-8bit-128g.pt	9.7G	~11.4G
bloom7b-0.2m-4bit-128g.pt	6.9G	~8.4G

我们也采用对应数据集基于LLAMA-7B训练调优了模型，现已开放:

Datasize	600,000	2,000,000
Finetuned Model	BELLE-LLAMA-7B-0.6M	BELLE-LLAMA-7B-2M

随后，我们会基于我们之前的工作Exploring ChatGPT's Ability to Rank Content: A Preliminary Study on Consistency with Human Preferences中的方法对比下这几个模型的效果。

数据发布

zh_seed_tasks.jsonl：包含175个种子任务。
0.5M生成的数据：为了方便模型训练，huggingface开源数据将原始生成文件中的"instruction"、"input"字段合并成"input"字段，"output"字段修改为"target"字段。
1M生成的数据：生成方式与0.5M数据集相同，在后处理中去掉了一些质量不高的数据，例如自称GPT模型的数据、由于input不完善导致模型无法回答的数据，以及指令是中文但input或target是英文的数据。

数据生成

沿用Alpaca的方式：

pip install -r requirements.txt
export OPENAI_API_KEY=YOUR_API_KEY
python generate_instruction.py generate_instruction_following_data

默认使用Completion API，模型text-davinci-003。如果想使用Chat API并使用gpt-3.5-turbo模型，可通过参数控制：

python generate_instruction.py generate_instruction_following_data \
    --api=chat --model_name=gpt-3.5-turbo

输出文件在Belle.train.json，可以人工筛选后再使用。

引用

如果使用本项目的代码、数据或模型，请引用本项目。

@misc{BELLE,
  author = {Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma and Xiangang Li},
  title = {BELLE: BE Large Language model Engine },
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/LianjiaTech/BELLE}},
}

当然，你也需要引用原始的BLOOM论文、LLaMA论文、Stanford Alpaca和Self-Instruct论文。

模型使用例子

Human	Assitant
写一篇英文散文诗，主题是春雨，想象自己是春雨，和英国古代诗人莎士比亚交流	I am the spring rain, Dancing on your face and body. My dew-drenched petals dance with you, Bringing life to every plant. Shakespeare, I am here to meet you. Your works have filled my heart with joy. The beauty of nature in its fullness Makes me feel like a partaker. As the leaves sway and rustle by, My dew-filled hand reaches out for them. The sweet scent of flowers fill the air, And bring me peace from all around. Butterflies flit about as well, Glistening in the sun's rays. Their wings are made of dew, A beautiful sight to behold. You see, Shakespeare, I too am one with nature. My dew-drenched hair is a symbol of hope. For when we can connect back to our roots, We will be able to make it through this season.
I am the spring rain, Dancing on your face and body. My dew-drenched petals dance with you, Bringing life to every plant. Shakespeare, I am here to meet you. Your works have filled my heart with joy. The beauty of nature in its fullness Makes me feel like a partaker. As the leaves sway and rustle by, My dew-filled hand reaches out for them. The sweet scent of flowers fill the air, And bring me peace from all around. Butterflies flit about as well, Glistening in the sun's rays. Their wings are made of dew, A beautiful sight to behold. You see, Shakespeare, I too am one with nature. My dew-drenched hair is a symbol of hope. For when we can connect back to our roots, We will be able to make it through this season. 把上面的散文诗翻译成中文	我是一场春雨，在你脸上跳跃。我浸润着露珠的翅膀和你一起舞动，为每一株植物带来生命。莎士比亚，我来与你相遇。你的作品填满了我的心房。大自然的美丽让我感到与之一起分享，让我觉得自己成为了其中的一部分。当树叶摇曳时，我的手掌沾满了露水。花香弥漫在空气中，带给我宁静和安宁。你看，莎士比亚，我也是与大自然融为一体的。我沾湿的头发象征着希望。当我们能够回到我们的根源，我们就能共同度过这个季节。
Classify the sentiment of the following sentence into Positive, Neutral, or Negative: Super excited about teaching Stanford’s first course on Large Language Models! Check the syllabus out here	Positive
小明的爸爸有三个孩子，老大叫王一，老二叫王二，老三叫什么？	小明
使用python写一个二分查找的代码	def binary_search(arr, target): low = 0 high = len(arr) - 1 while (low <= high): mid = low + high // 2 if arr[mid] == target: return mid elif target < arr[mid]: low = mid + 1 else: high = mid - 1 return -1
今天天气怎么样，把这句话翻译成英语	What's the weather like today?
怎么让自己精力充沛，列5点建议	以下是一些可以让你保持精力充沛的建议： 1. 睡眠充足。确保每天晚上得到充足的睡眠时间。 2. 多运动。进行适度的有氧运动，如慢跑、游泳等，可以帮助你提高能量和耐力。 3. 吃健康食品。选择富含蛋白质的食物，如鸡肉、鱼肉和鸡蛋，以及新鲜的水果和蔬菜。 4. 减少咖啡因的摄入量。尽量避免在白天喝含咖啡因的饮料或吃含有咖啡因的食物。 5. 放松身心。尝试冥想、瑜伽或其他形式的放松活动来减轻压力和焦虑感。
明天就假期结束了，有点抗拒上班，应该什么办？	面对即将结束的假期和开始工作时的紧张情绪，您可以采取以下措施来帮助您更好地适应： 1. 为自己制定一个清晰的工作计划。这可以帮助您更好地组织时间和任务，并避免在工作中感到混乱。 2. 与同事或朋友交流，分享您的感受和工作压力。这将有助于缓解一些焦虑感，同时也可以获得支持和建议。 3. 在工作日前花些时间放松身心，例如进行瑜伽、冥想等活动。这些方法可以减轻身体和精神上的疲劳感，使您更加精力充沛地迎接挑战。
父母都姓吴，取一些男宝宝和女宝宝的名字	- 男宝：吴俊杰、吴雨辰、吴子墨、吴思源、吴泽阳、吴宇航、吴博文、吴哲毅。 - 女宝：吴美琪、吴佳怡、吴欣瑜、吴香凝、吴雪柔、吴颖诗、吴语嫣。
推荐几本金庸的武侠小说	以下是三本我最喜欢的金庸武侠小说的推荐： 1. 《射雕英雄传》 - 一部充满智慧和机智的小说，讲述了郭靖、黄蓉夫妇的故事。 2. 《神雕侠侣》- 这是一部浪漫而感人的故事，讲述了一对情侣小龙女与杨过之间的爱情故事。 3. 《倚天屠龙记》- 这是一个宏伟而壮观的故事，描绘了明教领袖张无忌如何成为一位出色的英雄。

BELLE: BE Large Language model Engine

The goal of this project is to promote the development of the open-source community for Chinese language large-scale conversational models. This project optimizes Chinese performance in addition to original Alpaca. The model finetuning uses only data generated via ChatGPT (without other data).

This repo contains:

Data Release: The Chinese dataset generated 1M + 0.5M, using Stanford Alpaca as reference
The model optimized based on BLOOMZ-7B1-mt: BELLE-7B-0.2M，BELLE-7B-0.6M，BELLE-7B-1M，BELLE-7B-2M
The model optimized based on LLAMA: BELLE-LLAMA-7B-0.6M，BELLE-LLAMA-7B-2M

More prompts are welcomed via issues!

What's New

March 26, 2023: Released finetuned LLAMA 7B model trained on 2M instruction data. Released finetuned LLAMA 7B model trained on 0.6M instruction data
March 25, 2023: Released GPTQ quantized model，greatly reduces the hardware requirements for inference, and the relevant code is located in the 'gptq' directory.
March 25, 2023: Released all new higher quality 1M Chinese dataset，not overlap with previous 0.5M Chinese dataset
March 20, 2023: Released finetuned BLOOM 7B model trained on 2M instruction data.
March 18, 2023: Released finetuned BLOOM 7B model trained on 1M instruction data. Released finetuned BLOOM 7B model trained on 0，6M instruction data
March 17, 2023: Released finetuned BLOOM 7B model trained on 0，6M instruction data. Released 0.5M dataset

Overview

To promote the development of the Chinese conversational large model open-source community, this project refers to the Stanford Alpaca project, optimizes the Chinese data processing flow, and selects open-source pre-trained large language models (BLOOM or LLAMA) as the basis for training a better open-source model that supports Chinese instructions - BELLE (BE Large Language model Engine).

We open-sourced the data collection code based on Alpaca. Based on this code, we generated about 2 million Chinese data. Combined with Alpaca's 50,000 English data, the checkpoint trained on the BLOOMZ-7B model and LLAMA model were uploaded to Hugging Face。

Limitation and Usage Limits

There still exists a few issues in the model trained on current base model and data:

The model might generate factual errors when asked to follow instructions related to facts.
Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions.
Needs improvements on reasoning and coding.

Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed.

Fine-tuning and Models Release

We trained models on instruction learning datasets of different sizes (200,000, 600,000, 1 million, and 2 million samples) and based on the BLOOMZ-7B1-mt trained and optimized model. They are now release for use, you can download the checkpoints in haggingface BELLE group:

Datasize	200,000	600,000	1,000,000	2,000,000
Finetuned Model	BELLE-7B-0.2M	BELLE-7B-0.6M	BELLE-7B-1M	BELLE-7B-2M

In addition, for the convenience of users, we have also quantized the model based on GPTQ, which includes 4-bit and 8-bit quantized models

model name	file size	GPU memory usage
base	27G	~28.2G
bloom7b-2m-8bit-128g.pt	9.7G	~11.4G
bloom7b-2m-4bit-128g.pt	6.9G	~8.4G
bloom7b-0.2m-8bit-128g.pt	9.7G	~11.4G
bloom7b-0.2m-4bit-128g.pt	6.9G	~8.4G

We have also trained and optimized models based on LLAMA-7B using corresponding datasets, which are now open for use:

Datasize	600,000	2,000,000
Finetuned Model	BELLE-LLAMA-7B-0.6M	BELLE-LLAMA-7B-2M

Data Release

zh_seed_tasks.jsonl contains 175 seed tasks
0.5M generated data：To facilitate model training, Hugging Face open-sourced data that merged the "instruction" and "input" fields in the original generation file into a single "input" field, and renamed the "output" field as the "target" field.
1M generated data. Same generation pipeline as 0.5M dataset, removed lower-quality items in postprocessing, e.g. items regarding GPT model, bad items because of incomplete/invalid input, items with Chinese instructionb but English input or target.

Data Generation Process

Following Alpaca:

pip install -r requirements.txt
export OPENAI_API_KEY=YOUR_API_KEY
python generate_instruction.py generate_instruction_following_data

Uses the Completion API and text-davinci-003 model by default. To use Chat API and gpt-3.5-turbo model, just change the arguments:

python generate_instruction.py generate_instruction_following_data \
    --api=chat --model_name=gpt-3.5-turbo

Generated instructions are in Belle.train.json, you can check manually before using it.

Citation

Please cite us when using our code, data or model.

@misc{BELLE,
  author = {Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma and Xiangang Li},
  title = {BELLE: BE Large Language model Engine},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/LianjiaTech/BELLE}},
}

Cite the original BLOOM, LLaMA, Stanford Alpaca and Self-Instruct papers as well!

ranranBi/BELLE

BELLE: BE Large Language model Engine

What's New

概述

局限性和使用限制

模型发布

数据发布

数据生成

引用

模型使用例子

BELLE: BE Large Language model Engine

What's New

Overview

Limitation and Usage Limits

Fine-tuning and Models Release

Data Release

Data Generation Process

Citation