Array size mismatch when calculating cross_entropy2d
Opened this issue · 14 comments
This happens when executing nll_loss
located in code/torchfcn/utils.py
under the mode of sourceonly
. The training data is from GTA5.
The error occurs because the snippet inside the cross_entropy2d()
first use a mask to exclude elements whose values are less than 0 in target
(that is, labels). In other words, mislabeled pixels are not involved when calculating cross entropy.
However, the corresponding prediction values for those removed pixels still exist in log_p
, which leads to the array size conflict.
To use GTA5 data, the label set has to be mapped to the Cityscapes label set. Since GTA5 has a lot more classes and our target is CityScapes, we care only about the common classes. The data organization documentation provided here should help you:
https://github.com/VisionLearningGroup/taskcv-2017-public/tree/master/segmentation
We will include details about this in the README soon.
Actually, I found elements in log_p
is not removed accordingly if their corresponding label in target
is out of range.
This should not happen if the labels are preprocessed correctly. Refer to this condition here:
LSD-seg/code/torchfcn/utils.py
Line 55 in e373e89
This condition should ensure that the values of log_p
exist only for the "in range" values of the target
.
Well, things go on unexpectedly.
When I train on the GTA5 data, the program occasionally falls into the except block below:
LSD-seg/code/torchfcn/utils.py
Lines 54 to 57 in e373e89
BTW, I organized the file structure of dataset according to your code's specification.
In GTA5, there are some image-label pairs which are not of the same size. Hence this exception might be triggered there. Please use the clean filelist that we have uploaded in this repo ? You can find this in the data/filelist
directory. For training GTA5, these should be GTA5_<image/label>list_train.txt
.
Thanks, but I'm using your data/filelist
directory indeed. However, the results still turned to be what I said above.
Could you please specify your pytorch/torchvision version?
At first I used pytorch: 0.3.1-py36_cuda8.0.61_cudnn7.1.2_3
and torchvision: 0.2.0-py36h17b6947_1
, but I got this error when doing backward()
in sourceonly
mode with GTA5:
RuntimeError: invalid argument 1: the number of sizes provided must be greater or equal to the number of dimensions in the tensor at /opt/conda/conda-bld/pytorch_1523244252089/work/torch/lib/THC/generic/THCTensor.c:326
So I use the pytorch compiled from source(the latest version) according to this post.
I'm not sure whether it is the version of pytorch that leads to this problem. Thx.
>>> import torch
>>> torch.__version__
'0.2.0_3'
>>> import torchvision
>>> torchvision.__version__
'0.2.0'
Does this error occur with all files ? Can you verify why these errors occur ? If you can give more info from your end, we can help debugging this.
I guess it is all about the version of pytorch.
When I used 0.3.1
, cross_entropy2d worked fine but backward() met an RuntimeError said above.
When I used the latest pytorch compiled from source, backward() got right but cross_entropy2d failed in the try-except block below:
LSD-seg/code/torchfcn/utils.py
Lines 54 to 57 in e373e89
BTW, when trying to train on SYNTHIA dataset, I don't know which directory should be marked as synthia_mapped_to_cityscapes
specified in LSD-seg/data/filelist/SYNTHIA_labellist_train.txt
. Since SYNTHIA-RAND-CITYSCAPES datasets contains only three subdirectory, namely, Depth
, GT
and RGB
(which you choosed to be training images). After check with the images themselves, I picked GT/COLOR
as synthia_mapped_to_cityscapes
(labels). Then I run the code in sourceonly
mode, I then got RuntimeError due to array size mismatch in cross_entropy2d()
, specifically, target
(that is, labeled image) has shape of (minibatch x h x w x 4), not (minibatch x h x w).
LSD-seg/code/torchfcn/utils.py
Lines 35 to 44 in e373e89
Got a problem with the cross_entropy_2d function as well. Training is running well until it breakes at different points. Sometimes it stops after 350 iterations, sometimes after 2000. Turned off shuffling of the filelists so there should be no issue with the input data. Images and labels are fine.
The error which occurs:
Train epoch = 0: 11%|##3 | 331/2975 [07:48<1:02:12, 1.41s/it]�[ATraceback (most recent call last):
File "train.py", line 161, in <module>
main()
File "train.py", line 157, in main
trainer.train()
File "<ROOT>/LSD-seg/code/torchfcn/trainer_LSD.py", line 361, in train
self.train_epoch()
File "<ROOT>/LSD-seg/code/torchfcn/trainer_LSD.py", line 254, in train_epoch
lossD_src_real_c = cross_entropy2d(outD_src_real_c, label_forD, size_average=self.size_average)
File "<ROOT>/LSD-seg/code/torchfcn/utils.py", line 65, in cross_entropy2d
loss = F.nll_loss(log_p, target, weight=weight, size_average=False)
File "<ROOT_ENV>/lib/python2.7/site-packages/torch/nn/functional.py", line 676, in nll_loss
raise ValueError('Expected 2 or 4 dimensions (got {})'.format(dim))
ValueError: Expected 2 or 4 dimensions (got 0)
Exception KeyError: KeyError(<weakref at 0x7fae42094f70; to 'tqdm' at 0x7fae2c0d7dd0>,) in <bound method tqdm.__del__ of Train: 0%| | 0/33 [07:49<?, ?it/s]> ignored
Exception KeyError: KeyError(<weakref at 0x7fae2edc60a8; to 'tqdm' at 0x7fae2c0b4950>,) in <bound method tqdm.__del__ of Train epoch = 0: 11%|##3 | 331/2975 [07:48<1:02:12, 1.41s/it]> ignored
Maybe there is a issue due to versions of the installed packages. Until i used the exact version of pytorch you used (0.2.0_3) there were much more issues, so i guess it is important to use exactly your used build @swamiviv . So maybe you could post your versions you used for fcn and opencv packages as well?
Due to the not deterministic behaviour of the training i dont know what to do to get this running. :(
The only change i made was in segmentation_datasets.py to modify cityscapes labels, which i use as source domain. When i used stock code i got the same Exception mentioned above in the cross_entropy_2d function:
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generated/../THCReduceAll.cuh line=334 error=59 : device-side assert triggered
Exception KeyError: KeyError(<weakref at 0x7fac43de7f70; to 'tqdm' at 0x7fac02b08950>,) in <bound method tqdm.__del__ of Train epoch = 0: 0%| | 0/2975 [00:01<?, ?it/s]> ignored
Exception KeyError: KeyError(<weakref at 0x7fac320bd890; to 'tqdm' at 0x7fac30d6bdd0>,) in <bound method tqdm.__del__ of Train: 0%| | 0/33 [00:02<?, ?it/s]> ignored
Exception: (1L, 40L, 80L)
Traceback (most recent call last):
File "train.py", line 161, in <module>
main()
File "train.py", line 157, in main
trainer.train()
File "<ROOT>/LSD-seg/code/torchfcn/trainer_LSD.py", line 361, in train
self.train_epoch()
File "<ROOT>/LSD-seg/code/torchfcn/trainer_LSD.py", line 281, in train_epoch
lossF_src_adv_s = cross_entropy2d(outD_src_fake_s, domain_labels_tgt_real,size_average=self.size_average)
File "<ROOT>/LSD-seg/code/torchfcn/utils.py", line 61, in cross_entropy2d
mask = target >= 0
File "<ROOT_ENV>/lib/python2.7/site-packages/torch/autograd/variable.py", line 888, in __ge__
return self.ge(other)
File "<ROOT_ENV>/lib/python2.7/site-packages/torch/autograd/variable.py", line 802, in ge
return Ge.apply(self, other)
File "<ROOT_ENV>/lib/python2.7/site-packages/torch/autograd/_functions/compare.py", line 17, in forward
mask = getattr(a, cls.fn_name)(b)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generated/../THCTensorMathCompare.cuh:84
Because of this i flagged all labels, which are out of range (255), with -1, because with 255 we would have more than n_classes.
Due to that i changed the code in segmentation_datasets.py - SegmentationData_BaseClass - __getitem__(self, index)
to:
def __getitem__(self, index):
data_file = self.files[self.split][index]
# Loading image and label
img, lbl = self.image_label_loader(data_file['img'], data_file['lbl'], self.image_size, random_crop=True)
img = img[:,:,::-1]
img -= self.mean_bgr
img = img.transpose(2, 0, 1)
if self.dset != 'cityscapes':
lbl[lbl > 18] = -1
else:
lbl[lbl == -1] = 19
lbl = Image.fromarray(lbl.squeeze().astype(np.uint8))
lbl = np.array(lbl, dtype=np.int32)
lbl[lbl > 18] = -1
img = torch.from_numpy(img.copy()).float()
lbl = torch.from_numpy(lbl.copy()).long()
return img,lbl
Edit/Update
Fixed it. Due to image cropping there was a chance that there are images with only dont care labels (-1), so after the line log_p = log_p[target.view(n, h, w, 1).repeat(1, 1, 1, c) >= 0]
there wont be any entries left to calculate the loss and the exception was thrown. Wrote a walkaround to catch this and now the training is running fine. :)
@Toxiiin may I ask what your workaround involved? I am facing the same issue with
log_p = log_p[target.view(n, h, w, 1).repeat(1, 1, 1, c) >= 0]
@mattmcc97 The easiest solution as a workaround would be to ensure, that the cropped images not only consists out of stuff which is flagged as dont care (-1). A possibility would be to loop over the cropping operation until you get an image which has a proper amount of valid labels (!= -1).
Furthermore you could set the calculated loss log_p
manually to zero if there are only dont care labels in the current image (and so there wont be an update of the gradients in this step), but this would be the quick and dirty solution I think. :)
Thanks, I originally thought I might have a mask with everything labelled not interesting. It turns out, there was some corruption in my masks, and a few pixels were labeled with random unexpected values.