Eromera/erfnet_pytorch

CARLA Simulator - Semantic Segmentation

Closed this issue · 33 comments

Hello,

Firstly, thanks for this amazing work.

Secondly, I want to use the network to train it on my own dataset from (CARLA Simulator). Are there any tips on how to adapt your implementation to my own dataset (with only 12 classes of semantics) ?

Hi! Thanks and sorry for my late reply.
I haven't tested it on CARLA but it should be fine by modifying the NUM_CLASSES variable in the main.py and the "ignoreIndex" that is passed to the iouEval. For example if your ignore class is the label "11", you should modify "iouEvalTrain = iouEval(NUM_CLASSES)" line with
iouEvalTrain = iouEval(NUM_CLASSES, 11) and the same for the iouEvalVal line. Hope that works!

Hi @Eromera
thanks for the tips,
also I need to remove the weights which are not needed in the train function

I tried on my dataset, and I am receiving this error

File "/erfnet_pytorch-master/train/erfnet.py", line 20, in forward
 output = torch.cat([self.conv(input), self.pool(input)], 1)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 170 and 171 in dimension 3 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:111

And this is the part of the code which is giving the error

class DownsamplerBlock (nn.Module):
    def __init__(self, ninput, noutput):
        super(DownsamplerBlock, self).__init__()

        self.conv = nn.Conv2d(ninput, noutput-ninput, (3, 3), stride=2, padding=1, bias=True)
        self.pool = nn.MaxPool2d(2, stride=2)
        self.bn = nn.BatchNorm2d(noutput, eps=1e-3)

    def forward(self, input):
        print([self.conv(input).size(), self.pool(input).size()])
        output = torch.cat([self.conv(input), self.pool(input)], 1)
        output = self.bn(output)
        return F.relu(output)

The output of the print line is:

[(1, 13, 256, 341), (1, 3, 256, 341)]
[(1, 48, 128, 171), (1, 16, 128, 170)]

I see that there is a mismatch in the last dimension, 170 and 171 … but I don’t know why

Do you have any idea why can this be caused ?

I solved it by setting ceil_mode=True for then nn.MaxPool2d layer
Now I have another error

Traceback (most recent call last):
  File "main.py", line 506, in <module>
    main(parser.parse_args())
  File "main.py", line 460, in main
    model = train(args, model, True) #Train encoder
  File "main.py", line 232, in train
    loss = criterion(outputs, targets[:, 0])
  File "/home/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "main.py", line 82, in forward
    return self.loss(torch.nn.functional.log_softmax(outputs, dim=1), targets)
  File "/home/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/.local/lib/python2.7/site-packages/torch/nn/modules/loss.py", line 193, in forward
    self.ignore_index, self.reduce)
  File "/home/.local/lib/python2.7/site-packages/torch/nn/functional.py", line 1334, in nll_loss
    return torch._C._nn.nll_loss2d(input, target, weight, size_average, ignore_index, reduce)
RuntimeError: input and target batch or spatial sizes don't match: target [5 x 64 x 85], input [5 x 13 x 64 x 86] at /pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:24

I solved it by manually resizing the target
target = Resize((64, 86), Image.NEAREST)(target)
instead of
target = Resize(int(self.height/8), Image.NEAREST)(target)

And I got this error afterwards

main.py:292: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  inputs = Variable(images, volatile=True)    #volatile flag makes it free backward or outputs for eval
main.py:293: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  targets = Variable(labels, volatile=True)
main.py:297: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  epoch_loss_val.append(loss.data[0])
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f877c13ef10>> ignored
Traceback (most recent call last):
  File "main.py", line 507, in <module>
    main(parser.parse_args())
  File "main.py", line 461, in main
    model = train(args, model, True) #Train encoder
  File "main.py", line 304, in train
    iouEvalVal.addBatch(outputs.max(1)[1].unsqueeze(1).data, targets.data)
  File "/erfnet_pytorch-master/train/iouEval.py", line 61, in addBatch
    tp = torch.sum(torch.sum(torch.sum(tpmult, dim=0, keepdim=True), dim=2, keepdim=True), dim=3, keepdim=True).squeeze()
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 2)

Then I replaced this

tp = torch.sum(torch.sum(torch.sum(tpmult, dim=0, keepdim=True), dim=2, keepdim=True), dim=3, keepdim=True).squeeze()
fpmult = x_onehot * (1-y_onehot-ignores) #times prediction says its that class and gt says its not (subtracting cases when its ignore label!)
fp = torch.sum(torch.sum(torch.sum(fpmult, dim=0, keepdim=True), dim=2, keepdim=True), dim=3, keepdim=True).squeeze()
fnmult = (1-x_onehot) * (y_onehot) #times prediction says its not that class and gt says it is
fn = torch.sum(torch.sum(torch.sum(fnmult, dim=0, keepdim=True), dim=2, keepdim=True), dim=3, keepdim=True).squeeze()

by

tp = torch.sum(torch.sum(torch.sum(tpmult, dim=0, keepdim=True))).squeeze()
fpmult = x_onehot * (1-y_onehot-ignores) #times prediction says its that class and gt says its not (subtracting cases when its ignore label!)
fp = torch.sum(torch.sum(torch.sum(fpmult, dim=0, keepdim=True), dim=2, keepdim=True), dim=3, keepdim=True).squeeze()
fnmult = (1-x_onehot) * (y_onehot) #times prediction says its not that class and gt says it is
fn = torch.sum(torch.sum(torch.sum(fnmult, dim=0, keepdim=True))).squeeze()

I am not sure if this is right or not, but it was obvious to do so because both tensors are 1-dim tensors and my “second” torch.sum tries to sum in dim2.

On my dataset, the IoU validation is always the same, it doesn't change
Thus, it saves the model of the first epoch as the best, then it is not changed afterwards .. Up till now, I have been going with 7 epochs and still the same

Hello,

I had a question regarding the labels you have chosen for Carla. Have you chosen the red channel image you got after PostProcessing = 'Semantic Segmentation" as your label or have you converted it into the semantically segmented image (every object has different colors) and provided that as label?

It'd be really helpful to know!
Thanks in advance!

Regards,

@SoumiDas you mean instance segmentation ?

@SoumiDas I used the semantic segmentation labels obtained from the SemanticSegmentation Camera of CARLA ..
but actually I have this problem of stable Val IoU, so I don't know what is the problem .. I am waiting for the answer from @Eromera

exactly, all the labels are encoded in the red channel

Hi,

Did you encounter the error: RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed. at /opt/conda/conda-bld/pytorch-cpu_1532571975038/work/aten/src/THNN/generic/SpatialClassNLLCriterion.c:110 in your execution?

Thanks!

One more question is did you give the ignore_index value as 0 since the class 0 is meant for None or void in terms of Carla data.

I didn't use any ignore_index

I got varying IoU over the training set.

what did you change in the code ? did you receive some errors that I posted up ?

if it possible, can you share with me your python files, maybe I did something wrong, so I would like to try yours

Yes I received all of them that you posted. I also considered the ignore_index part, in our part to be the 0th class since it's None. I have no other changes apart from what you had mentioned and including the ignore_index.

did you change this fp = torch.sum(torch.sum(torch.sum(fpmult, dim=0, keepdim=True), dim=2, keepdim=True), dim=3, keepdim=True).squeeze() to one dimension ?

I did ignore_index=9, but I received this error

fpmult = x_onehot * (1-y_onehot-ignores) #times prediction says its that class and gt says its not (subtracting cases when its ignore label!)
RuntimeError: The size of tensor a (6) must match the size of tensor b (86) at non-singleton dimension 3

did you get this before ?

No I kept it similar to the code. Also, I just tried with 2 epochs to check if it's working. Training set IoU is increasing, but validation set IoU is stable as of now. I'll run for more epochs and check.

@SoumiDas yes this happens to me, what about the above error ?

Yes you're receiving that error because of this line:
x_onehot = x_onehot[:, :self.ignoreIndex]
y_onehot = y_onehot[:, :self.ignoreIndex]

This line in code is written to consider only the labels except ignore_index (in their case, 19 - the last one). For Carla, the index to be ignored is 0 since it's None. I did

x_onehot = x_onehot[:, self.ignoreIndex+1:]
y_onehot = y_onehot[:, self.ignoreIndex+1:]

and got rid of the error I was getting:

RuntimeError: The size of tensor a (0) must match the size of tensor b (86) at non-singleton dimension 3

so when I do what you did, I will not have the error anymore ?

I did what you suggested, and it is not giving errors now
I have the first IoU on VAL set = 91.67 after the first epoch, which for sure I believe it is wrong, it is so high to be in first epoch, but I will wait to see how the others will go

I think you can just replace those lines of true positive and false positive and false negative with the original lines of the iouEval.py code. You'll probably get considerable results after that.

Also, as the warning goes, 'volatile=True' is not valid anymore and has not effect. So you can try with 'with torch.no_grad:' maybe..

@SoumiDas
you can just replace those lines of true positive and false positive and false negative with the original lines of the iouEval.py code
Sorry, i don't understand what you said, can you please tell me which lines do you mean or tell me which lines to change exactly and with what ?

Also, as the warning goes, 'volatile=True' is not valid anymore and has not effect. So you can try with 'with torch.no_grad:' maybe..
I also don't get this, where is volatile=True or torch.no_grad

For the volatile=True part, you mean I change

inputs = Variable(images, volatile=True)    #volatile flag makes it free backward or outputs for eval
targets = Variable(labels, volatile=True)

to

with torch.no_grad():
            inputs = Variable(images)    #volatile flag makes it free backward or outputs for eval
            targets = Variable(labels)

Yes right.

Also the outputs = model(....) line after the targets=Variable(labels) function need to be within the torch.no_grad() since that stops it from calculating gradients/backpropagation.

And what I mean by "you can just replace those lines of true positive and false positive and false negative with the original lines of the iouEval.py code" is keep these below lines as they were:

tp = torch.sum(torch.sum(torch.sum(tpmult, dim=0, keepdim=True), dim=2, keepdim=True), dim=3, keepdim=True).squeeze() and the following lines to calculate fp, and fn.

Hope you get rid of that 90+% IoU in the first epoch after following these steps.

@SoumiDas
thanks a lot for your support

so this means that for tp, fp, fn, keep the original calculation with three dimensions, right ? do not change it as I did above

when I train the decoder, it always a runtimeerrorCUDNN_STATUS_INTERNAL_ERROR', can you give me some advice to handle it?

ti
tim 20181005220352

I think the runtimeerrorCUDNN_STATUS_INTERNAL_ERROR is normally related to mismatch between a number of classes in the labels and predictions, otherwise maybe its a wrong cudnn installation. Try manually installing it by downloading and copying it on the cuda folder.

I'm closing this issue since the conversation has been over for more than two months. If you keep having issues please reopen!