bgi-llm-promptcblue

我们参考run_clm.py，实现了对causal language模型的微调并在CCKS2023-PromptCBLUE中文医疗大模型评测基准—开源赛道中取得了A榜第2, B榜第2的成绩。

Training procedure

我们在baichuan-13b模型基座上，对全量参数进行了有监督的微调, (区别于通用的sft方案，我们在计算loss按照预训练的策略对于全部的token均计算了loss)

https://huggingface.co/yourui/bgi-promptcblue-baichuan-13b

选择了step=50000的checkpoints作为最终的模型(max_steps=58920)

微调：

chmod 755 ./promptcblue/supervised_finetuning/fintune.sh
./promptcblue/supervised_finetuning/fintune.sh

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 1
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
total_train_batch_size: 8
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
num_epochs: 2.0

Framework versions

Transformers 4.30.2
Pytorch 2.0.1+cu118
Datasets 2.12.0
Tokenizers 0.13.3

Data

在PromptCBLUE的基础训练数据上，扩充了到了235k条训练数据, 具体扩充方法见PromptCBLUE_data 训练数据文件：file

数据: total: 166779

type	训练数据
train.json	68900
CMeEE-V2	15000
CMeIE	14291
CHIP-CDN	6000
CHIP-CDEE	1587
IMCS-V2-NER	41765
CHIP-MDCFNPC	0
IMCS-V2-SR	0
IMCS-V2-DAC	0
CHIP-CTC	22962
CHIP-STS	16000
KUAKE-IR	10000
KUAKE-QIC	5000
KUAKE-QQR	0
KUAKE-QTR	24174
MedDG	10000
IMCS-V2-MRG	0

prompt处理如下：

f"Write a response that appropriately completes the Input.\n\nInput:\n{input}\n\nResponse:\n{target}{LLAMA_EOS_TOKEN}"

Generate

下载模型https://huggingface.co/yourui/bgi-promptcblue-baichuan-13b，并保存在model目录下

为了加速推理，推理数据分成八份，每份由一张卡推理。

chmod 755 ./script/PromptCBLUE_generate/generate_all.sh 
chmod 755 ./script/PromptCBLUE_generate/baichuan/generate.sh

./script/PromptCBLUE_generate/generate_all.sh baichuan