Inplace computation error with XBM
Closed this issue · 10 comments
When I attempt to use XBM, I receive the following error. The encoder is a torchvision resnet34. It does use a number of in-place operations, but I am not sure why this is a problem.
Traceback (most recent call last):
File "/home/andrew/siamese-classifier/train.py", line 252, in <module>
train(
File "/home/andrew/siamese-classifier/train.py", line 143, in train
Quaterion.fit(
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/quaterion/main.py", line 101, in fit
trainer.fit(
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance
batch_output = self.batch_loop.run(kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1705, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/optim/optimizer.py", line 113, in wrapper
return func(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/optim/adamw.py", line 119, in step
loss = closure()
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure
closure_result = closure()
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 141, in closure
self._backward_fn(step_output.closure_loss)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 304, in backward_fn
self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 191, in backward
self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, optimizer_idx, *args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward
model.backward(closure_loss, optimizer, optimizer_idx, *args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1450, in backward
loss.backward(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 256]] is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
When I enable anomaly detection, the error changes to
Traceback (most recent call last):
File "/home/andrew/siamese-classifier/train.py", line 252, in <module>
train(
File "/home/andrew/siamese-classifier/train.py", line 143, in train
Quaterion.fit(
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/quaterion/main.py", line 101, in fit
trainer.fit(
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
self.fit_loop.run()
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance
batch_output = self.batch_loop.run(kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1705, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/optim/optimizer.py", line 113, in wrapper
return func(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/optim/adamw.py", line 119, in step
loss = closure()
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure
closure_result = closure()
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 141, in closure
self._backward_fn(step_output.closure_loss)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 304, in backward_fn
self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 191, in backward
self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, optimizer_idx, *args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward
model.backward(closure_loss, optimizer, optimizer_idx, *args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1450, in backward
loss.backward(*args, **kwargs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'DivBackward0' returned nan values in its 1th output.
Hello, @andrewaf1
Can you please provide your environment details and a minimal code snippet to reproduce this situation?
My setup is somewhat similar to the tutorial here https://quaterion.qdrant.tech/tutorials/cars-tutorial.html. I am using triplet loss and the Adam optimizer to train a model that consists of a pretrained Resnet-34 encoder and a skip connection decoder.
I am running on a Debian Google Cloud instance with a Nvidia GPU. Python 3.9.0 is installed via Poetry.
Hi @andrewaf1,
It was not possible to reproduce the issue on my side. The provided error is expected when in-place operations used, but I can train other models with XBM. Currently, XBM has another issue because its memory usage gros exponentially (that will be fixed hopefully today), but it is not related to your case.
Could you please provide more details for me to try reproducing your case again?
- Can you train the same model without XBM?
- What is the source of pretrained Resnet34?
- Is it trainable or not?
- What are the exact arguments you pass to
TripletLoss
? - Wat are the exact arguments you pass to
XBMConfig
?
Thanks.
So this error goes away when I switch from hard or semihard mining to all... am I misunderstanding when I can use xmb?
As for those details,
- Yes
- Torchvision, but it also occurs when I replaced it with a purely linear model with zero in-place operations.
- Yes, but it still happens when I set it to not trainable.
- mining is hard or semihard, margin=0.2, distance_metric_name=Distance.EUCLIDEAN
- buffer_size=256, start_iteration=1 (the later is to speed up testing, I have also tried the default value)
Hi @andrewaf1, thanks for the info you provided!
Now that I reproduced the bug, and it is when Euclidean distance is used in the context of XBM --I verified that XBM can be used with Cosine distance without problems. You may want to use XBM with Cosine distance while I'm fixing the bug and releasing a patch. You may also consider upgrade your Quaterion installation, as we merged the new semi-hard implementation that has a better memory allocation.
Hi @andrewaf1, the fix to this bug was released in v0.1.34. You can update your installation and start using it.
Awesome, thanks! Just started using it.