请问 SFT 训练的时候训练数据的格式是怎么样的？

Question

请问 SFT 训练的时候训练数据的格式是怎么样的？

Closed this issue 2 months ago · 3 comments

您好：
大概有2个问题
1.
看了增量预训练的数据发现，基本有用的部分是 text 段，
如果我现在有格式为 { "instruction" : "xxx", "input": "xxx " , "output": " xx "} 的数据，
我该将数据改写为哪种格式来适应 Chinese-Mixtral-8x7B ，麻烦打个样！

Ref: 微调需要的数据集格式与预训练类似，数据集文件需要为jsonl格式：每行一个json，其中需要包含"text"字段，将instruction、input和output全部按照您需要的模板进行拼接。

tokenizer
#tokenizer/Mixtral-8x7B-v0.1-vocab 这个资源位置应该被删除了。我用其他的来代替了

非常感谢！！

Answer 1 · 2024-09-29T06:50:35.000Z

您好，首先您应该选择指令微调的模板格式，Chinese-Mixtral-8x7B是base模型，可以接受任何模版：

如果您想使用现有某个模型的指令模板，可以使用tokenizer.apply_chat_template来实现自动拼接（详细文档）。示例：

tokenizer = AutoTokenizer.from_pretrained("Some-Instruction-Model")

messages = [
    {"role": "system", "content": data["instruction"]},
    {"role": "user", "content": data["input"]},
    {"role": "assistant", "content": data["output"]},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

如果您希望自定义指令模板，您可以尝试创建Jinja模板（详细文档），然后使用tokenizer.apply_chat_template。或者直接编写代码，循环拼接字符串，手动拼接的一个例子可以参考此处。

Answer 2 · 2024-09-29T17:12:01.000Z

可能我并没有理解您的意思。
你的回复，属于 inference。
看一下repro

error
!python data/preprocess_datasets.py --ds_name AD --tokenizer_name_or_path mistralai/Mixtral-8x7B-v0.1

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(

./data/sft/AD/AD-train
./data/sft/AD/encoded-AD-train

./data/sft/AD/AD-dev
./data/sft/AD/encoded-AD-dev

Running tokenization (num_proc=2): 0% 0/6 [00:00<?, ? examples/s]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3438, in _map_single
batch = apply_function_on_filtered_inputs(
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 3300, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/content/Chinese-Mixtral-8x7B/data/preprocess_datasets.py", line 41, in
lambda example: tokenizer(example["text"]),
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 277, in getitem
value = self.data[key]
KeyError: 'text'

data structure

./data/sft/AD/AD-train.jsonl
data structure or field 如下
{"instruction": "如果您是小学老师,请根据学生的描述回答问题。", "input": "老师，有3只脚的动物吗", "output": "目前没发现"}
显然程序要求 data 里面包含 text 字段，
3. error pinpoint
https://github.com/HIT-SCIR/Chinese-Mixtral-8x7B/blob/main/train.py#L195
elif training_args.mode == "sft":
partial_trainer = partial(
SFTTrainer,
max_seq_length=training_args.model_max_length,
neftune_noise_alpha=training_args.neftune_noise_alpha,
peft_config=peft_config if peft_args.enable_lora else None,
dataset_text_field="text",
)
4.

现在就出现了问题，程序仅仅将 dataset_text_field 中的字段进行 tokenization 转换，而我提供的data 不符合要求，
如果遵循你的指导而自己按照模板通过某tokenizer 生成text 然后再作为 dataset 输入给 SFTTrainer 似乎有些绕。能否改成我提供符合格式的数据集就能运行？

还有一个问题就是由于没有过深入的了解 hugging face transformer code，并没有看到如何从input_ids 中提取 label 或者output 相关的东西，即提供的dataset 中某些column 是输入给model 的，另外一部分可能作为label 传递给trainer 的optimizer, 目前我还在察看相关底层的代码。

Answer 3 · 2024-09-30T08:28:21.000Z

从 SFTTrainer ， hugging face docs 找到答案了，非常感谢

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn(

./data/sft/AD/AD-train ./data/sft/AD/encoded-AD-train

./data/sft/AD/AD-dev ./data/sft/AD/encoded-AD-dev

/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(

./data/sft/AD/AD-train
./data/sft/AD/encoded-AD-train

./data/sft/AD/AD-dev
./data/sft/AD/encoded-AD-dev