airctic/icevision

IndexError: tensors used as indices must be long, byte or bool tensors (Retinanet only)

robmarkcole opened this issue ยท 16 comments

๐Ÿ› Bug

Describe the bug
On icevision 0.4.0, running retinanet on custom dataset (fire) I get the following:

/usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-28-c232684d32d4> in <module>()
      1 learn.freeze()
----> 2 learn.lr_find()

16 frames
/usr/local/lib/python3.6/dist-packages/fastai/callback/schedule.py in lr_find(self, start_lr, end_lr, num_it, stop_div, show_plot, suggestions)
    222     n_epoch = num_it//len(self.dls.train) + 1
    223     cb=LRFinder(start_lr=start_lr, end_lr=end_lr, num_it=num_it, stop_div=stop_div)
--> 224     with self.no_logging(): self.fit(n_epoch, cbs=cb)
    225     if show_plot: self.recorder.plot_lr_find()
    226     if suggestions:

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
    203             self.opt.set_hypers(lr=self.lr if lr is None else lr)
    204             self.n_epoch = n_epoch
--> 205             self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
    206 
    207     def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    152 
    153     def _with_events(self, f, event_type, ex, final=noop):
--> 154         try:       self(f'before_{event_type}')       ;f()
    155         except ex: self(f'after_cancel_{event_type}')
    156         finally:   self(f'after_{event_type}')        ;final()

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _do_fit(self)
    194         for epoch in range(self.n_epoch):
    195             self.epoch=epoch
--> 196             self._with_events(self._do_epoch, 'epoch', CancelEpochException)
    197 
    198     def fit(self, n_epoch, lr=None, wd=None, cbs=None, reset_opt=False):

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    152 
    153     def _with_events(self, f, event_type, ex, final=noop):
--> 154         try:       self(f'before_{event_type}')       ;f()
    155         except ex: self(f'after_cancel_{event_type}')
    156         finally:   self(f'after_{event_type}')        ;final()

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _do_epoch(self)
    188 
    189     def _do_epoch(self):
--> 190         self._do_epoch_train()
    191         self._do_epoch_validate()
    192 

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _do_epoch_train(self)
    180     def _do_epoch_train(self):
    181         self.dl = self.dls.train
--> 182         self._with_events(self.all_batches, 'train', CancelTrainException)
    183 
    184     def _do_epoch_validate(self, ds_idx=1, dl=None):

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    152 
    153     def _with_events(self, f, event_type, ex, final=noop):
--> 154         try:       self(f'before_{event_type}')       ;f()
    155         except ex: self(f'after_cancel_{event_type}')
    156         finally:   self(f'after_{event_type}')        ;final()

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in all_batches(self)
    158     def all_batches(self):
    159         self.n_iter = len(self.dl)
--> 160         for o in enumerate(self.dl): self.one_batch(*o)
    161 
    162     def _do_one_batch(self):

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in one_batch(self, i, b)
    176         self.iter = i
    177         self._split(b)
--> 178         self._with_events(self._do_one_batch, 'batch', CancelBatchException)
    179 
    180     def _do_epoch_train(self):

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
    152 
    153     def _with_events(self, f, event_type, ex, final=noop):
--> 154         try:       self(f'before_{event_type}')       ;f()
    155         except ex: self(f'after_cancel_{event_type}')
    156         finally:   self(f'after_{event_type}')        ;final()

/usr/local/lib/python3.6/dist-packages/fastai/learner.py in _do_one_batch(self)
    161 
    162     def _do_one_batch(self):
--> 163         self.pred = self.model(*self.xb)
    164         self('after_pred')
    165         if len(self.yb): self.loss = self.loss_func(self.pred, *self.yb)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/retinanet.py in forward(self, images, targets)
    556 
    557             # compute the losses
--> 558             losses = self.compute_loss(targets, head_outputs, anchors)
    559         else:
    560             # compute the detections

/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/retinanet.py in compute_loss(self, targets, head_outputs, anchors)
    406             matched_idxs.append(self.proposal_matcher(match_quality_matrix))
    407 
--> 408         return self.head.compute_loss(targets, head_outputs, anchors, matched_idxs)
    409 
    410     def postprocess_detections(self, head_outputs, anchors, image_shapes):

/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/retinanet.py in compute_loss(self, targets, head_outputs, anchors, matched_idxs)
     49         # type: (List[Dict[str, Tensor]], Dict[str, Tensor], List[Tensor], List[Tensor]) -> Dict[str, Tensor]
     50         return {
---> 51             'classification': self.classification_head.compute_loss(targets, head_outputs, matched_idxs),
     52             'bbox_regression': self.regression_head.compute_loss(targets, head_outputs, anchors, matched_idxs),
     53         }

/usr/local/lib/python3.6/dist-packages/torchvision/models/detection/retinanet.py in compute_loss(self, targets, head_outputs, matched_idxs)
    118                     foreground_idxs_per_image,
    119                     targets_per_image['labels'][matched_idxs_per_image[foreground_idxs_per_image]]
--> 120                 ] = 1.0
    121 
    122                 # find indices for which anchors should be ignored

IndexError: tensors used as indices must be long, byte or bool tensors

To Reproduce
Steps to reproduce the behavior:
Has happened during learn.lr_find(), but on another occasion I passed this point and got error during learn.fine_tune(50, 3e-3, freeze_epochs=1). I placed the notebook at https://github.com/robmarkcole/fire-detection-from-images/blob/master/pytorch/icevision/icevision_firenet_retinanet.ipynb

Expected behavior
No error

Screenshots
NA

Desktop (please complete the following information):

  • Mac Catalina 10.15.5

Additional context
None

Strangely, on rerunning learn.lr_find() the error does not occur on second attempt, however, it did then occur during training. Owing to the random nature of its occurrence it must be arising from a bad annotation?

Still seen on 0.4.0.post1

lgvaz commented

It's the first time I see this error, I have to admit I have no clue on what it might be... I'll run the notebook you shared and investigate further

You used the same code to train faster_rcnn and efficientdet without problems right?

used the same code to train faster_rcnn and efficientdet without problems right: correct :-)

Have rerun my notebook on icevision-0.4.0.post1 nose-1.3.7 resnest-0.0.6b20201204 and the error is now:

ValueError: All bounding boxes should have positive height and width. Found invalid box [183.92996215820312, 231.0, 383.9397277832031, 113.37773895263672] for target at index 11.

UPDATE: this was a separate issue that was introduced and then resolved separately

OK installed from master and now on 0.5.1. I might have some insight now. I initially ran with learning rate 3e-3 and thought this issue must be resolved, as training proceeded without error. I then switched to le 1e-3 and immediately get the tensors used as indices must be long, byte or bool tensors error again. Looking at the learning rate plot below it has all these strange jumps in it, with a large jump occurring around e-3. Could this be related? Reminder I only see this error with this dataset using retina net

image

lgvaz commented

If you're running this on a notebook, when you get the error can you try invoking a debugger with %debug and checking what are the value of the following variables?

  • targets_per_image
  • matched_idxs_per_image
  • foreground_idxs_per_image

I get:

> /usr/local/lib/python3.6/dist-packages/torchvision/models/detection/retinanet.py(120)compute_loss()
    118                     foreground_idxs_per_image,
    119                     targets_per_image['labels'][matched_idxs_per_image[foreground_idxs_per_image]]
--> 120                 ] = 1.0
    121 
    122                 # find indices for which anchors should be ignored

ipdb> 
ipdb> print(targets_per_image)
{'labels': tensor([], device='cuda:0', dtype=torch.int64), 'boxes': tensor([], device='cuda:0', size=(0, 4))}
ipdb> print(matched_idxs_per_image)
tensor([], dtype=torch.int32)
ipdb> print(foreground_idxs_per_image)
tensor([], dtype=torch.bool)

The training does not look correct at all

image

lgvaz commented

So the loss is exploding here, might be related. What happens if you run with a much smaller learning rate? Like 5e-5 or 1e-4

Alright, on 0.5.1. now and indeed loss is under control with a low lr and no error!

image

UPDATE: on a rerun the error again!

image

lgvaz commented

And now the loss isn't exploding as well =x

Any new insights on what might be happening?

Error persists in 0.7.0. Losses all look fine. No new insights im afraid

Alright on 0.7.1a1 I am no longer hitting this issue - an ideas if a recent change could have resolved this or am I just striking lucky somehow?

@robmarkcole may have to do with the fastai version update

Thanks @rsomani95 will close this then