NVIDIA/apex

Multiple independent models, only one requires apex.amp, crash in non-amp CPU model

lopuhin opened this issue · 13 comments

I have a use-case where I have a "main" model which is trained with apex.amp at opt_level "O1", and all is fine. But I also have a small supplementary model which does not need mixed precision training and is trained on CPU. When apex.amp is enabled, training the second model (after the first model was trained) crashes with:

File "model.py"
  pred_logits = model(logits)
File "venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
  result = self.forward(*input, **kwargs)
File "model.py", in forward
  return self.linear(x)
File "venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
  result = self.forward(*input, **kwargs)
File "venv/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
  return F.linear(input, self.weight, self.bias)
File "venv/lib/python3.6/site-packages/apex/amp/wrap.py", line 28, in wrapper
  return orig_fn(*new_args, **kwargs)
File "venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1370, in linear
  ret = torch.addmm(bias, input, weight.t())
File "venv/lib/python3.6/site-packages/apex/amp/wrap.py", line 21, in wrapper
  args[i] = utils.cached_cast(cast_fn, args[i], handle.cache)
File "venv/lib/python3.6/site-packages/apex/amp/utils.py", line 97, in cached_cast
  if cached_x.grad_fn.next_functions[1][0].variable is not x:
AttributeError: 'NoneType' object has no attribute 'next_functions'

This is happening with pytorch 1.3.1 and apex 2ca894d (latest master), and also happened with 82dac9c.

I'm not sure if this is a bug or me using apex.amp incorrectly - I see that the docs say that amp.initialize should be called only once (which is the case), but does this mean that all models to be used in the process must be passed? Is there a way around this? In this case the models are very unrelated and initializing them at once would be quite inconvenient.

I also created a simple repro - it crashes, but if we remove amp initialization or move the second model to GPU, the crash does not happen:

import torch
from apex import amp
from torchvision.models import resnet34
from torch.optim import SGD

device = torch.device('cuda')
model = resnet34()
optimizer = SGD(model.parameters(), lr=1e-2)
model.to(device)

use_amp = True
if use_amp:
    model, optimizer = amp.initialize(model, optimizer, opt_level='O1')
model(torch.randn(1, 3, 224, 224).to(device))

another_model = resnet34()
output = another_model(torch.randn(1, 3, 224, 224))
print(output.shape)

Same problem in same case. Did you find a solution?

We switched to O2 opt level which does not have the issue, also now mixed precision training is natively supported in pytorch since 1.6 - that also solves the issue.

Same problem here, but we cannot use O2 opt level because our model will not fully converge in this opt level.

Did anyone solve this apex error?

Did anyone solve this apex error?

I solved it by changing the apex/amp/utils.py as following.

# change this line (line 113)
- if cached_x.grad_fn.next_functions[1][0].variable is not x:
# into this
+ if cached_x.grad_fn.next_functions[0][0].variable is not x:

solved my problem according to your advice, thanks @zwithz

I got an error like

if cached_x.grad_fn.next_functions[0][0].variable is not x:
AttributeError: 'NoneType' object has no attribute 'variable'

It seems cached_x.grad_fn.next_functions[0][0] is None

Be careful adding the fix that @zwithz mentioned. I'm pretty sure it messed up mixed-precision training for me for me. After removing the fixed months later, everything is back to normal.

Be careful adding the fix that @zwithz mentioned. I'm pretty sure it messed up mixed-precision training for me for me. After removing the fixed months later, everything is back to normal.

Then, how did you solve that problem?

Having this issue running https://github.com/SwinTransformer/Transformer-SSL on SWIN-T, using a 3090, with precompiled apex from
pip install apex -f https://dl.fbaipublicfiles.com/vissl/packaging/apexwheels/py37_cu113_pyt11/download.html
and
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

The fix from the post did allow me to run the training, although I haven't seen any drastic differences so far (fingers crossed)
I hope this won't impact the training long term.

The change does seem to prevent it from causing the runtime error due to x not being the parent of the cached x, and in the case where it is training with
torch.is_grad_enabled() and x.requires_grad != cached_x.requires_grad:

Observing the values on my end of:
cached_x.grad_fn.next_functions[0][0].variable and x
In every case, they seemed to be the same and looked somewhat like this:

 Parameter containing:
tensor([ 0.1293,  0.1166,  0.0126,  0.1159,  0.0701, -0.1036, -0.0761,  0.0027,
        -0.1023, -0.0475,  0.1164,  0.0672,  0.1257,  0.0011, -0.0736,  0.0955,
         0.0106,  0.0243, -0.0612,  0.0593, -0.1066,  0.1152,  0.1263,  0.0521,
         0.1124,  0.0876, -0.0551, -0.1252,  0.0190,  0.0906, -0.0148,  0.0121,
         0.1070,  0.0596,  0.1079,  0.0212,  0.0162, -0.0345, -0.0244, -0.0767,
         0.0965,  0.1316,  0.0536,  0.0041, -0.0476, -0.1425, -0.0267, -0.1025,
        -0.1066, -0.0286, -0.0284,  0.0291, -0.1046,  0.1037, -0.1314, -0.0684,
        -0.0548,  0.0089,  0.0597, -0.0380,  0.0225, -0.0342, -0.0568, -0.0202,
         0.0291, -0.1402, -0.1005,  0.1128,  0.0653, -0.0039,  0.0046,  0.0199,
         0.0335, -0.0985, -0.0393, -0.1325, -0.1135, -0.0272, -0.0191,  0.1129,
         0.0249, -0.0234, -0.0040,  0.0806, -0.0437, -0.0270, -0.0290, -0.1164,
        -0.0202, -0.1334, -0.0776, -0.0919,  0.1075, -0.1330,  0.1391,  0.0541],
       device='cuda:0', requires_grad=True) 

 Parameter containing:
tensor([[-0.0170, -0.0239,  0.0477,  ...,  0.0148, -0.0025,  0.0132],
        [ 0.0459, -0.0163, -0.0274,  ...,  0.0240,  0.0403,  0.0145],
        [-0.0264, -0.0373,  0.0041,  ..., -0.0217,  0.0381,  0.0198],
        ...,
        [ 0.0131, -0.0127,  0.0433,  ..., -0.0061,  0.0056, -0.0072],
        [-0.0119,  0.0015,  0.0027,  ...,  0.0111, -0.0128,  0.0144],
        [ 0.0034,  0.0338, -0.0243,  ..., -0.0028, -0.0256,  0.0207]],
       device='cuda:0', requires_grad=True) 

 Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       device='cuda:0', requires_grad=True)

They also both had requires_grad=True and so in my case, it ends up using the cached x.
I hope that the different index doesn't affect my training.

Be careful adding the fix that @zwithz mentioned. I'm pretty sure it messed up mixed-precision training for me for me. After removing the fixed months later, everything is back to normal.

Then, how did you solve that problem?

我将代码这行代码

  • if cached_x.grad_fn.next_functions[0][0].variable is not x:
    改为
  • if cached_x.grad_fn.next_functions[1][0].variable is not x:
    运行成功了(⊙o⊙)…

I got the same error when I want to use one BERT model to embed two sentences. The model always crashes with "AttributeError: 'NoneType' object has no attribute 'next_functions'" whatever the second sentence is.
Strangely, I can run your simple repro successfully.
To explore the details, I debug my code and find that it won't go into if is_nested(x): or if x in cache: in apex.amp.utils.cached_cast(cast_fn, x, cache) when dealing with the first sentence, and the parameter cache keeps growing. However, when it just starts with the second sentence, the it will go into if x in cache: and wrong.

We switched to O2 opt level which does not have the issue, also now mixed precision training is natively supported in pytorch since 1.6 - that also solves the issue.

Thanks, it is useful