[BUG/Help] <title>ValueError: 130001 is not in list

Question

[BUG/Help] <title>ValueError: 130001 is not in list

WanJuWuGo opened this issue 2 years ago · 8 comments

WanJuWuGo commented 2 years ago

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

ptuning时候max_steps改大一点点, 就会这样，是我哪里搞错了吗

Expected Behavior

。

Steps To Reproduce

。

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

。

Answer 1 · 2023-04-13T13:33:04.000Z

代码和模型版本都是最新

Answer 2 · 2023-04-13T14:00:39.000Z

#432
看看这个，应该替换下output模型内的文件就可以解决

Answer 3 · 2023-04-14T05:03:36.000Z

mask_token = gMASK if gMASK in input_ids else MASK
这句代码有问题，input_ids为一个batch，一个batch只要有一个example中存在gMASK则mask_token就等于gMASK。

这样会导致下面这句报错（假设input_ids有一条数据存在gMASK，一条数据存在MASK）
mask_positions = [seq.tolist().index(mask_token) for seq in input_ids]

按照下面写把原来那两句注释掉
mask_positions =[]
for seq in input_ids:
mask_token = gMASK if gMASK in seq else MASK
mask_positions.append(seq.tolist().index(mask_token))

Answer 4 · 2023-04-14T06:15:31.000Z

mask_token = gMASK if gMASK in input_ids else MASK 这句代码有问题，input_ids为一个batch，一个batch只要有一个example中存在gMASK则mask_token就等于gMASK。

这样会导致下面这句报错（假设input_ids有一条数据存在gMASK，一条数据存在MASK） mask_positions = [seq.tolist().index(mask_token) for seq in input_ids]

按照下面写把原来那两句注释掉 mask_positions =[] for seq in input_ids: mask_token = gMASK if gMASK in seq else MASK mask_positions.append(seq.tolist().index(mask_token))

目前的实现里都是用gMASK的，如果没用gMASK就是出错了

Answer 5 · 2023-04-14T06:18:00.000Z

mask_token = gMASK if gMASK in input_ids else MASK 这句代码有问题，input_ids为一个batch，一个batch只要有一个example中存在gMASK则mask_token就等于gMASK。
这样会导致下面这句报错（假设input_ids有一条数据存在gMASK，一条数据存在MASK） mask_positions = [seq.tolist().index(mask_token) for seq in input_ids]
按照下面写把原来那两句注释掉 mask_positions =[] for seq in input_ids: mask_token = gMASK if gMASK in seq else MASK mask_positions.append(seq.tolist().index(mask_token))

目前的实现里都是用gMASK的，如果没用gMASK就是出错了

如果数据本身存在mask，在
tokenizer.build_inputs_with_special_tokens(a_ids, b_ids)就不会加入gMask

Answer 6 · 2023-04-14T06:21:52.000Z

mask_token = gMASK if gMASK in input_ids else MASK 这句代码有问题，input_ids为一个batch，一个batch只要有一个example中存在gMASK则mask_token就等于gMASK。
这样会导致下面这句报错（假设input_ids有一条数据存在gMASK，一条数据存在MASK） mask_positions = [seq.tolist().index(mask_token) for seq in input_ids]
按照下面写把原来那两句注释掉 mask_positions =[] for seq in input_ids: mask_token = gMASK if gMASK in seq else MASK mask_positions.append(seq.tolist().index(mask_token))

目前的实现里都是用gMASK的，如果没用gMASK就是出错了

那如果数据里有mask的情况，数据需要把mask去掉？

Answer 7 · 2023-04-14T06:24:08.000Z

mask_token = gMASK if gMASK in input_ids else MASK 这句代码有问题，input_ids为一个batch，一个batch只要有一个example中存在gMASK则mask_token就等于gMASK。
这样会导致下面这句报错（假设input_ids有一条数据存在gMASK，一条数据存在MASK） mask_positions = [seq.tolist().index(mask_token) for seq in input_ids]
按照下面写把原来那两句注释掉 mask_positions =[] for seq in input_ids: mask_token = gMASK if gMASK in seq else MASK mask_positions.append(seq.tolist().index(mask_token))

目前的实现里都是用gMASK的，如果没用gMASK就是出错了

那如果数据里有mask的情况，数据需要把mask去掉？

哦原来是这样，我终于知道出现这种错误的都是什么情况了。

Answer 8 · 2023-04-14T08:00:41.000Z

mask_token = gMASK if gMASK in input_ids else MASK 这句代码有问题，input_ids为一个batch，一个batch只要有一个example中存在gMASK则mask_token就等于gMASK。
这样会导致下面这句报错（假设input_ids有一条数据存在gMASK，一条数据存在MASK） mask_positions = [seq.tolist().index(mask_token) for seq in input_ids]
按照下面写把原来那两句注释掉 mask_positions =[] for seq in input_ids: mask_token = gMASK if gMASK in seq else MASK mask_positions.append(seq.tolist().index(mask_token))

目前的实现里都是用gMASK的，如果没用gMASK就是出错了

那如果数据里有mask的情况，数据需要把mask去掉？

现在应该已经修复了，不管数据里有没有 [MASK]，tokenizer都会在末尾加入 [gMASK]。不过还是建议把数据里的 [MASK] 去掉。目前transformers 没有提供可以不编码这些special token的选项。