thu-coai/CDial-GPT

尝试重构您的代码,出现 IndexError: Target -1 is out of bounds

HarborZeng opened this issue · 6 comments

我使用 torch==1.7.0,transformers==3.5.1 重构您的代码,在 update 方法的

(lm_loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids)

这一行遇到错误,但是切换环境到 torch==1.4.0,transformers==2.1.1 就没有这个问题,想必是版本问题,不知如何修复。

全文报错如下:

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
['[CLS] [speaker1] 王 雁 盟 [speaker2] 1 9 9 6 年 , 台 湾 计 算 机 程 序 设 计 师 王 雁 盟 到 欧 洲 旅 游 , 在 布 拉 格 街 头 他 为 街 头 艺 人 的 手 风 琴 演 奏 所 着 迷 。 于 是 在 第 二 年 , 他 拜 巴 黎 手 风 琴 演 奏 家 d o m i n i q u e b o d i n 为 师 , 学 习 手 风 琴 演 奏 技 术 。 1 9 9 8 年 回 台 湾 , 在 街 头 拉 着 他 的 手 风 琴 游 荡 。 之 后 , 他 开 始 为 电 影 、 剧 团 演 出 等 伴 奏 手 风 琴 。 到 2 0 0 3 年 , 他 为 几 米 的 《 地 下 铁 一 个 音 乐 的 旅 程 》 音 乐 剧 作 曲 与 演 出 。 《 漂 浮 的 手 风 琴 》 是 他 自 己 制 作 、 作 曲 并 演 奏 的 第 一 个 专 辑 。 [SEP]', '[CLS] [speaker1] 大 话 西 游 之 月 光 宝 盒 主 演 [speaker2] 罗 家 英 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']
['[CLS] [speaker1] [speaker1] [speaker1] [speaker1] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2]', '[CLS] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker1] [speaker2] [speaker2] [speaker2] [speaker2] [speaker2] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]']
['[UNK] [UNK] [UNK] [UNK] [UNK] [UNK] 1 9 9 6 年 , 台 湾 计 算 机 程 序 设 计 师 王 雁 盟 到 欧 洲 旅 游 , 在 布 拉 格 街 头 他 为 街 头 艺 人 的 手 风 琴 演 奏 所 着 迷 。 于 是 在 第 二 年 , 他 拜 巴 黎 手 风 琴 演 奏 家 d o m i n i q u e b o d i n 为 师 , 学 习 手 风 琴 演 奏 技 术 。 1 9 9 8 年 回 台 湾 , 在 街 头 拉 着 他 的 手 风 琴 游 荡 。 之 后 , 他 开 始 为 电 影 、 剧 团 演 出 等 伴 奏 手 风 琴 。 到 2 0 0 3 年 , 他 为 几 米 的 《 地 下 铁 一 个 音 乐 的 旅 程 》 音 乐 剧 作 曲 与 演 出 。 《 漂 浮 的 手 风 琴 》 是 他 自 己 制 作 、 作 曲 并 演 奏 的 第 一 个 专 辑 。 [SEP]', '[UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] 罗 家 英 [SEP] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]']
Current run is terminating due to exception: Target -1 is out of bounds..
Engine run is terminating due to exception: Target -1 is out of bounds..
Traceback (most recent call last):
  File "/t/main.py", line 40, in <module>
    trainer.run(train_dataloader, max_epochs=2)
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 691, in run
    return self._internal_run()
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 762, in _internal_run
    self._handle_exception(e)
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    raise e
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 730, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 828, in _run_once_on_dataset
    self._handle_exception(e)
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    raise e
  File "/lib/python3.7/site-packages/ignite/engine/engine.py", line 811, in _run_once_on_dataset
    self.state.output = self._process_function(self, self.state.batch)
  File "/home/kingsoft/gang/t/main.py", line 21, in update
    (lm_loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids)
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/transformers/modeling_openai.py", line 595, in forward
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 962, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/lib/python3.7/site-packages/torch/nn/functional.py", line 2468, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/lib/python3.7/site-packages/torch/nn/functional.py", line 2264, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
IndexError: Target -1 is out of bounds.

Process finished with exit code 1

main.py

from transformers import BertTokenizer, OpenAIGPTLMHeadModel
from dataset import get_dataloader

tokenizer = BertTokenizer.from_pretrained("models/CDial-GPT_LCCC-large", do_lower_case=True)

train_dataloader = get_dataloader(tokenizer)

model = OpenAIGPTLMHeadModel.from_pretrained("models/CDial-GPT_LCCC-large")

from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=1e-5)

import torch
def update(engine, batch):
  input_ids, token_type_ids, lm_labels = tuple(batch)
  print(tokenizer.batch_decode(input_ids))
  print(tokenizer.batch_decode(token_type_ids))
  print(tokenizer.batch_decode(lm_labels))
  model.train()
  (lm_loss), *_ = model(input_ids, labels=lm_labels, token_type_ids=token_type_ids)
  loss = lm_loss / 64
  loss.backward()
  torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
  if engine.state.iteration % 64 == 0:
    optimizer.step()
    optimizer.zero_grad()
  return loss.item(), optimizer.param_groups[0]['lr']


from ignite.engine import create_supervised_trainer

# from torch import nn
# trainer = create_supervised_trainer(model, optimizer, loss_fn=nn.NLLLoss())

from ignite.engine import Engine

trainer = Engine(update)

trainer.run(train_dataloader, max_epochs=2)

dataset.py

import os
from itertools import chain
import torch
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pad_sequence

SPECIAL_TOKENS = ["[CLS]", "[SEP]", "[speaker1]", "[speaker2]"]
MODEL_INPUTS = ["input_ids", "lm_labels", "token_type_ids"]


class WBDataset(Dataset):

  def __init__(self, data, tokenizer, max_history=15, batch_first=True, lm_labels=True):
    self.data = data
    self.tokenizer = tokenizer
    self.max_history = max_history
    self.pad = tokenizer.pad_token_id
    self.batch_first = batch_first
    self.lm_labels = lm_labels

  def __len__(self):
    return len(self.data)

  def __getitem__(self, index):
    history = self.data[index][-2 * self.max_history:-1]
    resposne = self.data[index][-1]
    return self.process(history, resposne)

  def process(self, history, resposne, with_eos=True):
    bos, eos, speaker1, speaker2 = self.tokenizer.convert_tokens_to_ids(SPECIAL_TOKENS)

    sequence = [[bos]] + history + [resposne + ([eos] if with_eos else [])]
    sequence = [sequence[0]] + [[speaker2 if i % 2 else speaker1] + s
                                for i, s in enumerate(sequence[1:])]
    instance = {
      "input_ids": list(chain(*sequence)),
      "token_type_ids": [bos] + [speaker2 if i % 2 else speaker1 for i, s in enumerate(sequence[1:]) for _ in s],
      "lm_labels": ([-1] * sum(len(s) for s in sequence[:-1])) + [-1] + sequence[-1][1:]
    }

    return instance

  def collate(self, batch):
    input_ids = pad_sequence(
      [torch.tensor(instance["input_ids"][:512], dtype=torch.long) for instance in batch],
      batch_first=self.batch_first, padding_value=self.pad)
    token_type_ids = pad_sequence(
      [torch.tensor(instance["token_type_ids"][:512], dtype=torch.long) for instance in batch],
      batch_first=self.batch_first, padding_value=self.pad)
    labels = pad_sequence(
      [torch.tensor(instance["lm_labels"][:512], dtype=torch.long) for instance in batch],
      batch_first=self.batch_first, padding_value=-1)
    return input_ids, token_type_ids, labels


def get_dataset(tokenizer):
  dataset_cache = "dataset_cache_" + type(tokenizer).__name__
  if os.path.isfile(dataset_cache):
    dataset = torch.load(dataset_cache)
  else:
    import json
    dataset = {
      "train": json.load(open("data/corpus.json"))["conversations"]
    }

    def tokenize(obj):
      if isinstance(obj, str):
        return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj))
      if isinstance(obj, dict):
        return dict((n, tokenize(o)) for n, o in obj.items())
      return list(tokenize(o) for o in obj)

    dataset = tokenize(dataset)
    torch.save(dataset, dataset_cache)
  return dataset


def get_dataloader(tokenizer):
  dataset = get_dataset(tokenizer)
  train_dataset = WBDataset(dataset["train"], tokenizer)
  train_loader = DataLoader(train_dataset, collate_fn=train_dataset.collate, batch_size=2, shuffle=True)
  return train_loader

我猜是label padding的数字的问题,您尝试看下不同版本ignored_index那里,再看看我们padding数据的时候用的是哪个,您相应的根据版本改一下。

我猜是label padding的数字的问题,您尝试看下不同版本ignored_index那里,再看看我们padding数据的时候用的是哪个,您相应的根据版本改一下。

我该在哪里加呢,我重构的代码还没开始定义 loss function 呢,我查了下,那个 ignore_index 好像是和 loss function 有关的

应该是在不同版本的transformers库中,
GPT2LMHeadModel
这个类在计算loss的时候实现方法不一样导致的。

建议去对比一下这两个版本中,在计算loss的时候实现上的不同。

重点关注他们在计算loss的时候给ignore_index的赋值

应该是在不同版本的transformers库中,
GPT2LMHeadModel
这个类在计算loss的时候实现方法不一样导致的。

建议去对比一下这两个版本中,在计算loss的时候实现上的不同。

重点关注他们在计算loss的时候给ignore_index的赋值

感谢您提供的思路,顺着您的思路,去查了源码,发现老版本中,https://github.com/huggingface/transformers/blob/v2.1.1/transformers/modeling_openai.py#L517

loss_fct = CrossEntropyLoss(ignore_index=-1)

而新版本中去掉了 ignore_index=-1 默认值:https://github.com/huggingface/transformers/blob/v3.5.1/src/transformers/modeling_openai.py#L594

loss_fct = CrossEntropyLoss()

查看 CrossEntropyLoss 源码发现 ignore_index: int = -100 如此默认值,于是,将 collate 函数中倒数第二行改为 padding_value=-100process 函数中 "lm_labels": ([-100] * sum(len(s) for s in sequence[:-1])) + [-100] + sequence[-1][1:] 把所有 padding value 改为 -100。

这样就能成功运行了,我可以继续重构了。

我不知道新版本中在计算loss的其他地方是否有不同,请在重构的时候多加注意。