[BUG/Help] 求大佬看下，微调多轮对话，预测时出现130000 is not in list问题

Question

[BUG/Help] 求大佬看下，微调多轮对话，预测时出现130000 is not in list问题

q497629642 opened this issue 2 years ago · 38 comments

q497629642 commented 2 years ago

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Expected Behavior

No response

Steps To Reproduce

使用AutoTokenizer、AutoModel加载微调好的多轮对话checkpoint，使用stream_chat预测

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

q497629642 commented 2 years ago

我试试

Answer 1 · 2023-04-07T03:37:52.000Z

应该是大佬们昨晚上在hf上更新了新模型和代码的原因，重新下载拉取试试。我也是碰到这个问题

Answer 2 · 2023-04-07T03:51:02.000Z

应该是大佬们昨晚上在hf上更新了新模型和代码的原因，重新下载拉取试试。我也是碰到这个问题

试了，还是报那个问题

Answer 3 · 2023-04-07T04:57:07.000Z

遇到这个问题+1

Answer 4 · 2023-04-07T05:49:03.000Z

你要更新你的checkpoint目录里面的ice_text.model和tokenization_chatglm.py

Answer 5 · 2023-04-07T05:55:31.000Z

你要更新你的checkpoint目录里面的ice_text.model和tokenization_chatglm.py

请问怎么更新能说具体点吗？昨晚是更新了代码，重新下载了模型，然后花了4个多小时训练的，有办法补救吗

Answer 6 · 2023-04-07T05:58:08.000Z

150001 not in the list, reinstall everthing still not work.

Answer 7 · 2023-04-07T06:07:26.000Z

你要更新你的checkpoint目录里面的ice_text.model和tokenization_chatglm.py

更新了，还是报错

Answer 8 · 2023-04-07T06:08:27.000Z

你要更新你的checkpoint目录里面的ice_text.model和tokenization_chatglm.py

请问怎么更新能说具体点吗？昨晚是更新了代码，重新下载了模型，然后花了4个多小时训练的，有办法补救吗

把加载模型的目录下的这两个文件替换了，从 https://huggingface.co/THUDM/chatglm-6b 下载

Answer 9 · 2023-04-07T06:14:58.000Z

input_ids [20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20005, 85421, 20061, 95898, 20032, 88554, 20061, 97257, 84555, 20032, 85107, 20061, 86268, 20032, 85347, 20061, 91689, 20032, 89768, 20061, 105428, 20032, 85173, 93942, 20061, 90984, 20032, 85173, 90936, 20061, 84703, 85509, 150001, 150004]
inputs 类型#上衣材质#牛仔布颜色#白色风格#简约图案#刺绣衣样式#外套衣款式#破洞
label_ids [20005, 91689, 86561, 87061, 97257, 90984, 20006, 92194, 85173, 84290, 84622, 101549, 83823, 85173, 84290, 103343, 83832, 83912, 85209, 84703, 85509, 84051, 20006, 89418, 98598, 107019, 20006, 84257, 91319, 86069, 94197, 83823, 85173, 92265, 84880, 84131, 83832, 93416, 105428, 86261, 20006, 85594, 107834, 20006, 93412, 125145, 85388, 83823, 150001, 150004]
labels 简约而不简单的牛仔外套,白色的衣身十分百搭。衣身多处有做旧破洞设计,打破单调乏味,增加一丝造型看点。衣身后背处有趣味刺绣装饰,丰富层次感,彰显别样时尚。
全更新了，这tokenizer之后貌似啥mask也没有

Answer 10 · 2023-04-07T06:16:48.000Z

你要更新你的checkpoint目录里面的ice_text.model和tokenization_chatglm.py

请问怎么更新能说具体点吗？昨晚是更新了代码，重新下载了模型，然后花了4个多小时训练的，有办法补救吗

把加载模型的目录下的这两个文件替换了，从 https://huggingface.co/THUDM/chatglm-6b 下载

依然无法解决

Answer 11 · 2023-04-07T06:17:13.000Z

你要更新你的checkpoint目录里面的ice_text.model和tokenization_chatglm.py

请问怎么更新能说具体点吗？昨晚是更新了代码，重新下载了模型，然后花了4个多小时训练的，有办法补救吗

input_ids [20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20005, 85421, 20061, 95898, 20032, 88554, 20061, 97257, 84555, 20032, 85107, 20061, 86268, 20032, 85347, 20061, 91689, 20032, 89768, 20061, 105428, 20032, 85173, 93942, 20061, 90984, 20032, 85173, 90936, 20061, 84703, 85509, 150001, 150004] inputs 类型#上衣_材质#牛仔布_颜色#白色_风格#简约_图案#刺绣_衣样式#外套_衣款式#破洞 label_ids [20005, 91689, 86561, 87061, 97257, 90984, 20006, 92194, 85173, 84290, 84622, 101549, 83823, 85173, 84290, 103343, 83832, 83912, 85209, 84703, 85509, 84051, 20006, 89418, 98598, 107019, 20006, 84257, 91319, 86069, 94197, 83823, 85173, 92265, 84880, 84131, 83832, 93416, 105428, 86261, 20006, 85594, 107834, 20006, 93412, 125145, 85388, 83823, 150001, 150004] labels 简约而不简单的牛仔外套,白色的衣身十分百搭。衣身多处有做旧破洞设计,打破单调乏味,增加一丝造型看点。衣身后背处有趣味刺绣装饰,丰富层次感,彰显别样时尚。全更新了，这tokenizer之后貌似啥mask也没有

估计就是分词还是有问题

Answer 12 · 2023-04-07T06:18:29.000Z

input_ids [20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20005, 85421, 20061, 95898, 20032, 88554, 20061, 97257, 84555, 20032, 85107, 20061, 86268, 20032, 85347, 20061, 91689, 20032, 89768, 20061, 105428, 20032, 85173, 93942, 20061, 90984, 20032, 85173, 90936, 20061, 84703, 85509, 150001, 150004] inputs 类型#上衣_材质#牛仔布_颜色#白色_风格#简约_图案#刺绣_衣样式#外套_衣款式#破洞 label_ids [20005, 91689, 86561, 87061, 97257, 90984, 20006, 92194, 85173, 84290, 84622, 101549, 83823, 85173, 84290, 103343, 83832, 83912, 85209, 84703, 85509, 84051, 20006, 89418, 98598, 107019, 20006, 84257, 91319, 86069, 94197, 83823, 85173, 92265, 84880, 84131, 83832, 93416, 105428, 86261, 20006, 85594, 107834, 20006, 93412, 125145, 85388, 83823, 150001, 150004] labels 简约而不简单的牛仔外套,白色的衣身十分百搭。衣身多处有做旧破洞设计,打破单调乏味,增加一丝造型看点。衣身后背处有趣味刺绣装饰,丰富层次感,彰显别样时尚。全更新了，这tokenizer之后貌似啥mask也没有

不好意思，还需要更新一下 tokenizer_config.json。150001就是gmask，decode的时候本来就是不显示的。更新之后应该是130001，去掉了前20000个不用的image token

Answer 13 · 2023-04-07T06:20:55.000Z

我可以了，把那几个文件更新之后，需要tokenizer 也要换成自己的chepoint路径（还要把.cache/huggingface/modules/transformers_modules/checkpoint-1000/缓存删掉）
tokenizer = AutoTokenizer.from_pretrained("./output/XXX-chatglm-6b-pt-8-1e-2/checkpoint-1000", trust_remote_code=True)
model = AutoModel.from_pretrained("./output/XXX-chatglm-6b-pt-8-1e-2/checkpoint-1000", trust_remote_code=True).half().cuda()

Answer 14 · 2023-04-07T06:23:11.000Z

input_ids [20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20005, 85421, 20061, 95898, 20032, 88554, 20061, 97257, 84555, 20032, 85107, 20061, 86268, 20032, 85347, 20061, 91689, 20032, 89768, 20061, 105428, 20032, 85173, 93942, 20061, 90984, 20032, 85173, 90936, 20061, 84703, 85509, 150001, 150004] inputs 类型#上衣_材质#牛仔布_颜色#白色_风格#简约_图案#刺绣_衣样式#外套_衣款式#破洞 label_ids [20005, 91689, 86561, 87061, 97257, 90984, 20006, 92194, 85173, 84290, 84622, 101549, 83823, 85173, 84290, 103343, 83832, 83912, 85209, 84703, 85509, 84051, 20006, 89418, 98598, 107019, 20006, 84257, 91319, 86069, 94197, 83823, 85173, 92265, 84880, 84131, 83832, 93416, 105428, 86261, 20006, 85594, 107834, 20006, 93412, 125145, 85388, 83823, 150001, 150004] labels 简约而不简单的牛仔外套,白色的衣身十分百搭。衣身多处有做旧破洞设计,打破单调乏味,增加一丝造型看点。衣身后背处有趣味刺绣装饰,丰富层次感,彰显别样时尚。全更新了，这tokenizer之后貌似啥mask也没有

不好意思，还需要更新一下 tokenizer_config.json。150001就是gmask，decode的时候本来就是不显示的。更新之后应该是130001，去掉了前20000个不用的image token

ok了，是这个问题

Answer 15 · 2023-04-07T06:23:44.000Z

input_ids [20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20005, 85421, 20061, 95898, 20032, 88554, 20061, 97257, 84555, 20032, 85107, 20061, 86268, 20032, 85347, 20061, 91689, 20032, 89768, 20061, 105428, 20032, 85173, 93942, 20061, 90984, 20032, 85173, 90936, 20061, 84703, 85509, 150001, 150004] inputs 类型#上衣_材质#牛仔布_颜色#白色_风格#简约_图案#刺绣_衣样式#外套_衣款式#破洞 label_ids [20005, 91689, 86561, 87061, 97257, 90984, 20006, 92194, 85173, 84290, 84622, 101549, 83823, 85173, 84290, 103343, 83832, 83912, 85209, 84703, 85509, 84051, 20006, 89418, 98598, 107019, 20006, 84257, 91319, 86069, 94197, 83823, 85173, 92265, 84880, 84131, 83832, 93416, 105428, 86261, 20006, 85594, 107834, 20006, 93412, 125145, 85388, 83823, 150001, 150004] labels 简约而不简单的牛仔外套,白色的衣身十分百搭。衣身多处有做旧破洞设计,打破单调乏味,增加一丝造型看点。衣身后背处有趣味刺绣装饰,丰富层次感,彰显别样时尚。全更新了，这tokenizer之后貌似啥mask也没有

不好意思，还需要更新一下 tokenizer_config.json。150001就是gmask，decode的时候本来就是不显示的。更新之后应该是130001，去掉了前20000个不用的image token

感谢，可以了

Answer 16 · 2023-04-07T06:24:19.000Z

可以解决，谢谢！

Answer 17 · 2023-04-07T11:16:17.000Z

重新下载了模型和config文件，仍然报错

Answer 18 · 2023-04-07T11:16:59.000Z

[5, 65421, 61, 67329, 32, 98339, 61, 72043, 32, 65347, 61, 70872, 32, 69768, 61, 68944, 32, 67329, 64103, 61, 96914, 0, 0, 5, 87052, 96914, 81471, 64562, 65759, 64493, 64988, 6, 65840, 65388, 74531, 63825, 75786, 64009, 63823, 65626, 63882, 64619, 65388, 6, 64480, 65604, 85646, 110945, 10, 64089, 65966, 87052, 67329, 65544, 6, 71964, 70533, 64417, 63862, 89978, 63991, 63823, 77284, 88473, 64219, 63848, 112012, 6, 71231, 65099, 71252, 66800, 85768, 64566, 64338, 100323, 75469, 63823, 117317, 64218, 64257, 64051, 74197, 6, 63893, 0]
默认的AdvertiseGen数据集跑的

Answer 19 · 2023-04-07T13:37:27.000Z

下载最新权重文件，运行evaluate.sh 后依然报错，应该是150001未变为130001导致的。

input_ids [20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20005, 85421, 20061, 95898, 20032, 88554, 20061, 97257, 84555, 20032, 85107, 20061, 86268, 20032, 85347, 20061, 91689, 20032, 89768, 20061, 105428, 20032, 85173, 93942, 20061, 90984, 20032, 85173, 90936, 20061, 84703, 85509, 150001, 150004]

Answer 20 · 2023-04-07T14:45:21.000Z

[5, 65421, 61, 67329, 32, 98339, 61, 72043, 32, 65347, 61, 70872, 32, 69768, 61, 68944, 32, 67329, 64103, 61, 96914, 0, 0, 5, 87052, 96914, 81471, 64562, 65759, 64493, 64988, 6, 65840, 65388, 74531, 63825, 75786, 64009, 63823, 65626, 63882, 64619, 65388, 6, 64480, 65604, 85646, 110945, 10, 64089, 65966, 87052, 67329, 65544, 6, 71964, 70533, 64417, 63862, 89978, 63991, 63823, 77284, 88473, 64219, 63848, 112012, 6, 71231, 65099, 71252, 66800, 85768, 64566, 64338, 100323, 75469, 63823, 117317, 64218, 64257, 64051, 74197, 6, 63893, 0] 默认的AdvertiseGen数据集跑的

更新ice_text.model

Answer 21 · 2023-04-07T14:45:54.000Z

下载最新权重文件，运行evaluate.sh 后依然报错，应该是150001未变为130001导致的。

input_ids [20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20005, 85421, 20061, 95898, 20032, 88554, 20061, 97257, 84555, 20032, 85107, 20061, 86268, 20032, 85347, 20061, 91689, 20032, 89768, 20061, 105428, 20032, 85173, 93942, 20061, 90984, 20032, 85173, 90936, 20061, 84703, 85509, 150001, 150004]

更新ice_text.model, tokenization_chatglm.py 和 tokenizer_config.json

Answer 22 · 2023-04-07T15:55:07.000Z

哎，还是想吐槽一句。hf的代码更新了，本地代码的复制代码处理方式都不会同步更新到，sentencepiece会跑不起来

Answer 23 · 2023-04-07T16:10:07.000Z

不好意思，我是加载的 int4qe模型，也出现了 150001 is not in list的报错，我看了大家上面的讨论，但我没有在chatglm-6b-int4-qe文件夹中找到ice_text.model, tokenization_chatglm.py 和 tokenizer_config.json这三个文件啊？
是修改哪里（我一个小时前刚刚删除了这个文件夹重新从hf上进行了下载……）

Answer 24 · 2023-04-07T16:17:02.000Z

下载最新权重文件，运行evaluate.sh 后依然报错，应该是150001未变为130001导致的。
input_ids [20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20005, 85421, 20061, 95898, 20032, 88554, 20061, 97257, 84555, 20032, 85107, 20061, 86268, 20032, 85347, 20061, 91689, 20032, 89768, 20061, 105428, 20032, 85173, 93942, 20061, 90984, 20032, 85173, 90936, 20061, 84703, 85509, 150001, 150004]

更新ice_text.model, tokenization_chatglm.py 和 tokenizer_config.json

我今天重新下了一次所有的文件，并且重新pull了一次代码，但是微调还是130001 is not in list，/(ㄒoㄒ)/~~

Answer 25 · 2023-04-08T02:50:45.000Z

不好意思，我是加载的 int4qe模型，也出现了 150001 is not in list的报错，我看了大家上面的讨论，但我没有在chatglm-6b-int4-qe文件夹中找到ice_text.model, tokenization_chatglm.py 和 tokenizer_config.json这三个文件啊？是修改哪里（我一个小时前刚刚删除了这个文件夹重新从hf上进行了下载……）

https://huggingface.co/THUDM/chatglm-6b-int4-qe/tree/main

Answer 26 · 2023-04-08T04:40:32.000Z

出现130000有两种情况：
第一种情况是推理预训练模型时出现的，这种情况是预训练模型的相关文件没有更新导致的，下载最新的ce_text.model, tokenization_chatglm.py 和 tokenizer_config.json替换掉。
第二种情况是推理微调模型时出现的，这种情况是生成的微调模型的相关文件是旧版本导致的，复制预训练模型下的ce_text.model, tokenization_chatglm.py 和 tokenizer_config.json替换掉output下对应文件。
不知道为什么微调时一直生成旧版本的文件，删除了.cache/huggingface/modules/transformers_modules下的缓存无效

Answer 27 · 2023-04-08T05:32:58.000Z

@xiaoyaolangzhi 感谢，用第二种情况cp了之后能正常推理了。

商品广告文案能正常预测了，但‘介绍下北京’之类的正常的对话能力严重退化，P-tuning方法还是会造成模型能力变窄甚至退化，我用LoRA方法不会影响原始模型预测能力。

Answer 28 · 2023-04-08T06:08:16.000Z

出现130000有两种情况：第一种情况是推理预训练模型时出现的，这种情况是预训练模型的相关文件没有更新导致的，下载最新的ce_text.model, tokenization_chatglm.py 和 tokenizer_config.json替换掉。第二种情况是推理微调模型时出现的，这种情况是生成的微调模型的相关文件是旧版本导致的，复制预训练模型下的ce_text.model, tokenization_chatglm.py 和 tokenizer_config.json替换掉output下对应文件。不知道为什么微调时一直生成旧版本的文件，删除了.cache/huggingface/modules/transformers_modules下的缓存无效

@xiaoyaolangzhi 之前的tokenizer在保存config的时候有一个bug，已经修复了，多谢指。之前保存的checkpoint还需要手动更新一下。

Answer 29 · 2023-04-10T09:17:58.000Z

上面提到的checkpoint怎么手动更新呢？

Answer 30 · 2023-04-14T08:03:59.000Z

#596 之前有一个bug就是数据里如果有 [MASK] 会被 tokenizer 当作特殊token，导致 [gMASK] 没有加上。目前已经修复了

Answer 31 · 2023-04-22T00:03:45.000Z

出现130000有两种情况：第一种情况是推理预训练模型时出现的，这种情况是预训练模型的相关文件没有更新导致的，下载最新的ce_text.model, tokenization_chatglm.py 和 tokenizer_config.json替换掉。第二种情况是推理微调模型时出现的，这种情况是生成的微调模型的相关文件是旧版本导致的，复制预训练模型下的ce_text.model, tokenization_chatglm.py 和 tokenizer_config.json替换掉output下对应文件。不知道为什么微调时一直生成旧版本的文件，删除了.cache/huggingface/modules/transformers_modules下的缓存无效

在预测的时候出错了，下载了最新的文件，也没有效果。output文件夹下面没有找到tokenization_chatglm.py和tokenizer_config.json. 这个只有model文件夹有

Answer 32 · 2023-04-25T15:55:38.000Z

input_ids [20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20003, 20005, 85421, 20061, 95898, 20032, 88554, 20061, 97257, 84555, 20032, 85107, 20061, 86268, 20032, 85347, 20061, 91689, 20032, 89768, 20061, 105428, 20032, 85173, 93942, 20061, 90984, 20032, 85173, 90936, 20061, 84703, 85509, 150001, 150004] inputs 类型#上衣_材质#牛仔布_颜色#白色_风格#简约_图案#刺绣_衣样式#外套_衣款式#破洞 label_ids [20005, 91689, 86561, 87061, 97257, 90984, 20006, 92194, 85173, 84290, 84622, 101549, 83823, 85173, 84290, 103343, 83832, 83912, 85209, 84703, 85509, 84051, 20006, 89418, 98598, 107019, 20006, 84257, 91319, 86069, 94197, 83823, 85173, 92265, 84880, 84131, 83832, 93416, 105428, 86261, 20006, 85594, 107834, 20006, 93412, 125145, 85388, 83823, 150001, 150004] labels 简约而不简单的牛仔外套,白色的衣身十分百搭。衣身多处有做旧破洞设计,打破单调乏味,增加一丝造型看点。衣身后背处有趣味刺绣装饰,丰富层次感,彰显别样时尚。全更新了，这tokenizer之后貌似啥mask也没有

不好意思，还需要更新一下 tokenizer_config.json。150001就是gmask，decode的时候本来就是不显示的。更新之后应该是130001，去掉了前20000个不用的image token

怎么更改是去tokenizer_config.json 里 "gmask_token": "[gMASK]", [gMASK] 改成130001吗？

Answer 33 · 2023-04-26T06:12:43.000Z

下载最新权重文件 https://huggingface.co/THUDM/chatglm-6b ，这个问题已经修复了。

Answer 34 · 2023-04-27T12:30:42.000Z

remove add_special_tokens=False

Answer 35 · 2023-05-29T03:49:35.000Z

如果使用pipeline，可能需要修改 ../python3.9/site-packages/transformers/pipelines/text_generation.py:203

def preprocess(self, prompt_text, prefix="", handle_long_generation=None, **generate_kwargs):
    inputs = self.tokenizer(
        prefix + prompt_text, padding=False, add_special_tokens=True, return_tensors=self.framework
    )

Answer 36 · 2023-08-16T08:27:59.000Z

chatglm2-6b也报错

Answer 37 · 2023-08-17T07:13:50.000Z

微信  18979917773 具体什么情况   景（@）景 ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: ***@***.***>; 发送时间: 2023年8月16日(星期三) 下午4:28 收件人: ***@***.***>; 抄送: "景（@***@***.***>; ***@***.***>; 主题: Re: [THUDM/ChatGLM-6B] [BUG/Help] 求大佬看下，微调多轮对话，预测时出现130000 is not in list问题 (Issue #432) chatglm2-6b也报错 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>