qdrant/quaterion

Inplace computation error with XBM

Closed this issue · 10 comments

When I attempt to use XBM, I receive the following error. The encoder is a torchvision resnet34. It does use a number of in-place operations, but I am not sure why this is a problem.

Traceback (most recent call last):
  File "/home/andrew/siamese-classifier/train.py", line 252, in <module>
    train(
  File "/home/andrew/siamese-classifier/train.py", line 143, in train
    Quaterion.fit(
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/quaterion/main.py", line 101, in fit
    trainer.fit(
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1705, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step
    return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/optim/optimizer.py", line 113, in wrapper
    return func(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/optim/adamw.py", line 119, in step
    loss = closure()
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure
    closure_result = closure()
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 141, in closure
    self._backward_fn(step_output.closure_loss)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 304, in backward_fn
    self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 191, in backward
    self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, optimizer_idx, *args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward
    model.backward(closure_loss, optimizer, optimizer_idx, *args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1450, in backward
    loss.backward(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 256]] is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

When I enable anomaly detection, the error changes to

Traceback (most recent call last):
  File "/home/andrew/siamese-classifier/train.py", line 252, in <module>
    train(
  File "/home/andrew/siamese-classifier/train.py", line 143, in train
    Quaterion.fit(
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/quaterion/main.py", line 101, in fit
    trainer.fit(
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1550, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1705, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step
    return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/optim/optimizer.py", line 113, in wrapper
    return func(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/optim/adamw.py", line 119, in step
    loss = closure()
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure
    closure_result = closure()
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 141, in closure
    self._backward_fn(step_output.closure_loss)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 304, in backward_fn
    self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 191, in backward
    self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, optimizer_idx, *args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward
    model.backward(closure_loss, optimizer, optimizer_idx, *args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1450, in backward
    loss.backward(*args, **kwargs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/andrew/.cache/pypoetry/virtualenvs/imageclassifier-1EEfuA3P-py3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'DivBackward0' returned nan values in its 1th output.
joein commented

Hello, @andrewaf1

Can you please provide your environment details and a minimal code snippet to reproduce this situation?

My setup is somewhat similar to the tutorial here https://quaterion.qdrant.tech/tutorials/cars-tutorial.html. I am using triplet loss and the Adam optimizer to train a model that consists of a pretrained Resnet-34 encoder and a skip connection decoder.

I am running on a Debian Google Cloud instance with a Nvidia GPU. Python 3.9.0 is installed via Poetry.

Any update on this @joein? Any more environment details I can provide?

Hi @andrewaf1,
It was not possible to reproduce the issue on my side. The provided error is expected when in-place operations used, but I can train other models with XBM. Currently, XBM has another issue because its memory usage gros exponentially (that will be fixed hopefully today), but it is not related to your case.

Could you please provide more details for me to try reproducing your case again?

  1. Can you train the same model without XBM?
  2. What is the source of pretrained Resnet34?
  3. Is it trainable or not?
  4. What are the exact arguments you pass to TripletLoss?
  5. Wat are the exact arguments you pass to XBMConfig?

Thanks.

So this error goes away when I switch from hard or semihard mining to all... am I misunderstanding when I can use xmb?

As for those details,

  1. Yes
  2. Torchvision, but it also occurs when I replaced it with a purely linear model with zero in-place operations.
  3. Yes, but it still happens when I set it to not trainable.
  4. mining is hard or semihard, margin=0.2, distance_metric_name=Distance.EUCLIDEAN
  5. buffer_size=256, start_iteration=1 (the later is to speed up testing, I have also tried the default value)

Hi @andrewaf1, thanks for the info you provided!
Now that I reproduced the bug, and it is when Euclidean distance is used in the context of XBM --I verified that XBM can be used with Cosine distance without problems. You may want to use XBM with Cosine distance while I'm fixing the bug and releasing a patch. You may also consider upgrade your Quaterion installation, as we merged the new semi-hard implementation that has a better memory allocation.

Hi @andrewaf1, the fix to this bug was released in v0.1.34. You can update your installation and start using it.

Awesome, thanks! Just started using it.