swamiviv/LSD-seg

Array size mismatch when calculating cross_entropy2d

Opened this issue · 14 comments

This happens when executing nll_loss located in code/torchfcn/utils.py under the mode of sourceonly. The training data is from GTA5.

The error occurs because the snippet inside the cross_entropy2d() first use a mask to exclude elements whose values are less than 0 in target(that is, labels). In other words, mislabeled pixels are not involved when calculating cross entropy.

However, the corresponding prediction values for those removed pixels still exist in log_p, which leads to the array size conflict.

To use GTA5 data, the label set has to be mapped to the Cityscapes label set. Since GTA5 has a lot more classes and our target is CityScapes, we care only about the common classes. The data organization documentation provided here should help you:

https://github.com/VisionLearningGroup/taskcv-2017-public/tree/master/segmentation

We will include details about this in the README soon.

Actually, I found elements in log_p is not removed accordingly if their corresponding label in target is out of range.

This should not happen if the labels are preprocessed correctly. Refer to this condition here:

log_p = log_p[target.view(n, h, w, 1).repeat(1, 1, 1, c) >= 0]

This condition should ensure that the values of log_p exist only for the "in range" values of the target.

Well, things go on unexpectedly.
When I train on the GTA5 data, the program occasionally falls into the except block below:

try:
log_p = log_p[target.view(n, h, w, 1).repeat(1, 1, 1, c) >= 0]
except:
print "Exception: ", target.size()

BTW, I organized the file structure of dataset according to your code's specification.

In GTA5, there are some image-label pairs which are not of the same size. Hence this exception might be triggered there. Please use the clean filelist that we have uploaded in this repo ? You can find this in the data/filelist directory. For training GTA5, these should be GTA5_<image/label>list_train.txt.

Thanks, but I'm using your data/filelist directory indeed. However, the results still turned to be what I said above.
Could you please specify your pytorch/torchvision version?
At first I used pytorch: 0.3.1-py36_cuda8.0.61_cudnn7.1.2_3 and torchvision: 0.2.0-py36h17b6947_1, but I got this error when doing backward() in sourceonly mode with GTA5:

RuntimeError: invalid argument 1: the number of sizes provided must be greater or equal to the number of dimensions in the tensor at /opt/conda/conda-bld/pytorch_1523244252089/work/torch/lib/THC/generic/THCTensor.c:326

So I use the pytorch compiled from source(the latest version) according to this post.
I'm not sure whether it is the version of pytorch that leads to this problem. Thx.

>>> import torch
>>> torch.__version__
'0.2.0_3'
>>> import torchvision
>>> torchvision.__version__
'0.2.0'

Does this error occur with all files ? Can you verify why these errors occur ? If you can give more info from your end, we can help debugging this.

I guess it is all about the version of pytorch.
When I used 0.3.1, cross_entropy2d worked fine but backward() met an RuntimeError said above.
When I used the latest pytorch compiled from source, backward() got right but cross_entropy2d failed in the try-except block below:

try:
log_p = log_p[target.view(n, h, w, 1).repeat(1, 1, 1, c) >= 0]
except:
print "Exception: ", target.size()

BTW, when trying to train on SYNTHIA dataset, I don't know which directory should be marked as synthia_mapped_to_cityscapes specified in LSD-seg/data/filelist/SYNTHIA_labellist_train.txt. Since SYNTHIA-RAND-CITYSCAPES datasets contains only three subdirectory, namely, Depth, GT and RGB(which you choosed to be training images). After check with the images themselves, I picked GT/COLOR as synthia_mapped_to_cityscapes(labels). Then I run the code in sourceonly mode, I then got RuntimeError due to array size mismatch in cross_entropy2d(), specifically, target(that is, labeled image) has shape of (minibatch x h x w x 4), not (minibatch x h x w).

def cross_entropy2d(input, target, weight=None, size_average=True):
"""
Function to compute pixelwise cross-entropy for 2D image. This is the segmentation loss.
Args:
input: input tensor of shape (minibatch x num_channels x h x w)
target: 2D label map of shape (minibatch x h x w)
weight (optional): tensor of size 'C' specifying the weights to be given to each class
size_average (optional): boolean value indicating whether the NLL loss has to be normalized
by the number of pixels in the image
"""

@swamiviv Do you upload the mapped data?

Got a problem with the cross_entropy_2d function as well. Training is running well until it breakes at different points. Sometimes it stops after 350 iterations, sometimes after 2000. Turned off shuffling of the filelists so there should be no issue with the input data. Images and labels are fine.

The error which occurs:

Train epoch = 0:  11%|##3                  | 331/2975 [07:48<1:02:12,  1.41s/it]�[ATraceback (most recent call last):  
File "train.py", line 161, in <module>
    main()
  File "train.py", line 157, in main
    trainer.train()
  File "<ROOT>/LSD-seg/code/torchfcn/trainer_LSD.py", line 361, in train
    self.train_epoch()
  File "<ROOT>/LSD-seg/code/torchfcn/trainer_LSD.py", line 254, in train_epoch
    lossD_src_real_c = cross_entropy2d(outD_src_real_c, label_forD, size_average=self.size_average)
  File "<ROOT>/LSD-seg/code/torchfcn/utils.py", line 65, in cross_entropy2d
    loss = F.nll_loss(log_p, target, weight=weight, size_average=False)
  File "<ROOT_ENV>/lib/python2.7/site-packages/torch/nn/functional.py", line 676, in nll_loss
    raise ValueError('Expected 2 or 4 dimensions (got {})'.format(dim))
ValueError: Expected 2 or 4 dimensions (got 0)
Exception KeyError: KeyError(<weakref at 0x7fae42094f70; to 'tqdm' at 0x7fae2c0d7dd0>,) in <bound method tqdm.__del__ of Train:   0%|                                             | 0/33 [07:49<?, ?it/s]> ignored
Exception KeyError: KeyError(<weakref at 0x7fae2edc60a8; to 'tqdm' at 0x7fae2c0b4950>,) in <bound method tqdm.__del__ of Train epoch = 0:  11%|##3                  | 331/2975 [07:48<1:02:12,  1.41s/it]> ignored

Maybe there is a issue due to versions of the installed packages. Until i used the exact version of pytorch you used (0.2.0_3) there were much more issues, so i guess it is important to use exactly your used build @swamiviv . So maybe you could post your versions you used for fcn and opencv packages as well?

Due to the not deterministic behaviour of the training i dont know what to do to get this running. :(

The only change i made was in segmentation_datasets.py to modify cityscapes labels, which i use as source domain. When i used stock code i got the same Exception mentioned above in the cross_entropy_2d function:

/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generated/../THCReduceAll.cuh line=334 error=59 : device-side assert triggered
Exception KeyError: KeyError(<weakref at 0x7fac43de7f70; to 'tqdm' at 0x7fac02b08950>,) in <bound method tqdm.__del__ of Train epoch = 0:   0%|                                 | 0/2975 [00:01<?, ?it/s]> ignored
Exception KeyError: KeyError(<weakref at 0x7fac320bd890; to 'tqdm' at 0x7fac30d6bdd0>,) in <bound method tqdm.__del__ of Train:   0%|                                             | 0/33 [00:02<?, ?it/s]> ignored
Exception:  (1L, 40L, 80L)
Traceback (most recent call last):
  File "train.py", line 161, in <module>
    main()
  File "train.py", line 157, in main
    trainer.train()
  File "<ROOT>/LSD-seg/code/torchfcn/trainer_LSD.py", line 361, in train
    self.train_epoch()
  File "<ROOT>/LSD-seg/code/torchfcn/trainer_LSD.py", line 281, in train_epoch
    lossF_src_adv_s = cross_entropy2d(outD_src_fake_s, domain_labels_tgt_real,size_average=self.size_average)
  File "<ROOT>/LSD-seg/code/torchfcn/utils.py", line 61, in cross_entropy2d
    mask = target >= 0
  File "<ROOT_ENV>/lib/python2.7/site-packages/torch/autograd/variable.py", line 888, in __ge__
    return self.ge(other)
  File "<ROOT_ENV>/lib/python2.7/site-packages/torch/autograd/variable.py", line 802, in ge
    return Ge.apply(self, other)
  File "<ROOT_ENV>/lib/python2.7/site-packages/torch/autograd/_functions/compare.py", line 17, in forward
    mask = getattr(a, cls.fn_name)(b)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generated/../THCTensorMathCompare.cuh:84

Because of this i flagged all labels, which are out of range (255), with -1, because with 255 we would have more than n_classes.

Due to that i changed the code in segmentation_datasets.py - SegmentationData_BaseClass - __getitem__(self, index) to:

def __getitem__(self, index):
        data_file = self.files[self.split][index]
        
        # Loading image and label
        img, lbl = self.image_label_loader(data_file['img'], data_file['lbl'], self.image_size, random_crop=True)
        img = img[:,:,::-1]
        img -= self.mean_bgr
        img = img.transpose(2, 0, 1)
        
        if self.dset != 'cityscapes':
            lbl[lbl > 18] = -1
        else:
            lbl[lbl == -1] = 19 
            lbl = Image.fromarray(lbl.squeeze().astype(np.uint8))
            lbl = np.array(lbl, dtype=np.int32)
            lbl[lbl > 18] = -1

        img = torch.from_numpy(img.copy()).float() 
        lbl = torch.from_numpy(lbl.copy()).long()

        return img,lbl

Edit/Update
Fixed it. Due to image cropping there was a chance that there are images with only dont care labels (-1), so after the line log_p = log_p[target.view(n, h, w, 1).repeat(1, 1, 1, c) >= 0] there wont be any entries left to calculate the loss and the exception was thrown. Wrote a walkaround to catch this and now the training is running fine. :)

@Toxiiin may I ask what your workaround involved? I am facing the same issue with

log_p = log_p[target.view(n, h, w, 1).repeat(1, 1, 1, c) >= 0]

@mattmcc97 The easiest solution as a workaround would be to ensure, that the cropped images not only consists out of stuff which is flagged as dont care (-1). A possibility would be to loop over the cropping operation until you get an image which has a proper amount of valid labels (!= -1).

Furthermore you could set the calculated loss log_p manually to zero if there are only dont care labels in the current image (and so there wont be an update of the gradients in this step), but this would be the quick and dirty solution I think. :)

Thanks, I originally thought I might have a mask with everything labelled not interesting. It turns out, there was some corruption in my masks, and a few pixels were labeled with random unexpected values.