Alibaba-NLP/ACE

RuntimeError: Found param embeddings.list_embedding_0.model.embeddings.word_embeddings.weight with type torch.FloatTensor, expected torch.cuda.FloatTensor.

gly99999 opened this issue · 1 comments

你好,不好意思又要麻烦你看个问题,我把模型开启这个混合精度,我在train.py添加了use_ampamp_opt_level这两个参数,然后执行有下面这个问题,看错误应该是要把embeddings加载到gpu里吧,我后来把ReinforcementTrainer中的train函数中的参数embeddings_storage_mode改成了gpu,看了代码之后感觉不是这里的问题,而且改成这个后,我这里的gpu也没那么大的内存。
然后我也调试看了,下面这个图片中的model,发现里面有的embeddings是不在gpu里的,现在不知道是不是这个原因,如果要把embeddings全部加载到gpu,感觉内存也会不够的,再次麻烦你看一下了。
image

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : 128.0
Traceback (most recent call last):
  File "/home/gly21/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 547, in train
    self.model, optimizer, opt_level=amp_opt_level, loss_scale=128.0
  File "/home/gly21/.conda/envs/gly_ace/lib/python3.7/site-packages/apex/amp/frontend.py", line 358, in initialize
    return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
  File "/home/gly21/.conda/envs/gly_ace/lib/python3.7/site-packages/apex/amp/_initialize.py", line 171, in _initialize
    check_params_fp32(models)
  File "/home/gly21/.conda/envs/gly_ace/lib/python3.7/site-packages/apex/amp/_initialize.py", line 93, in check_params_fp32
    name, param.type()))
  File "/home/gly21/.conda/envs/gly_ace/lib/python3.7/site-packages/apex/amp/_amp_state.py", line 33, in warn_or_err
python-BaseException
    raise RuntimeError(msg)
RuntimeError: Found param embeddings.list_embedding_0.model.embeddings.word_embeddings.weight with type torch.FloatTensor, expected torch.cuda.FloatTensor.
When using amp.initialize, you need to provide a model with parameters
located on a CUDA device before passing it no matter what optimization level
you chose. Use model.to('cuda') to use the default device.

Process finished with exit code 1

这块我没研究过,如果amp的输入需要self.model都在gpu上的话,建议在代码把所有句子的embedding encode好之后,直接把self.model.embeddings给删掉,然后再输入到amp里面。embedding_storage_mode最好不要改成gpu,这样子每句话预存的embedding都会以gpu形式存在数据里,gpu很容易不够。

不过如果删掉embedding的话,可能会导致验证和测试有问题,这个可能得你自己操作一下,比如说把self.model.embedding在删备份一下