sacmehta/ESPNetv2

Performance Issues on NVIDIA GTX1080

nithishc829 opened this issue · 10 comments

I trained the same ESPNetV2 on my GTX 1080 CUDA GPU for 10 class semantic segmentation. I did some modifications to code so it works for my 10 classes. Input image size was 640x480 I got an mIoU of 62% on validation which is really good. However I was expecting a greater performance but I got 52 fps as inference speed which is average of all samples[5:]. I wanted to know why there is a huge difference in performance claimed by paper(140 fps) and implementation(50fps) ?... this is how ran the code
python main.py --batch_size 10 --s 1.0 --inWidth 640 --inHeight 480 --max_epochs 350 --batch_size 32 --classes 10 --csvfile ~/data/Cityscape_v2/class_dict_grouped.csv --data_dir pwd
the last parameter I have added to train for my dataset.

You should not include the image reading and writing time, because those operations are slow. As a standard convention, none of the models report inference time with image reading and writing operations.

No sir I am not including them

ROI of code
start_time = time.time()
# run the mdoel
output1 = model(input)
time_taken = time.time() - start_time
time_list.append(time_taken)

Function used for inference time

def val(args, val_loader, model, criterion):
'''
:param args: general arguments
:param val_loader: loaded for validation dataset
:param model: model
:param criterion: loss function
:return: average epoch loss, overall pixel-wise accuracy, per class accuracy, per class iu, and mIOU
'''
#switch to evaluation mode
model.eval()

iouEvalVal = iouEval(args.classes)

epoch_loss = []
time_list = []
total_batches = len(val_loader)
blist = helpers.get_label_info_new(args.csvfile)
for i, (input, target) in enumerate(val_loader):
    
    if args.onGPU :
        print('Non_blocking')
        input = input.cuda(non_blocking=True) #torch.autograd.Variable(input, volatile=True)
        target = target.cuda(non_blocking=True)#torch.autograd.Variable(target, volatile=True)
    else:
        print('Blocking')
        input = input.cuda() #torch.autograd.Variable(input, volatile=True)
        target = target.cuda()#torch.autograd.Variable(target, volatile=True)

    start_time = time.time()
    # run the mdoel
    output1 = model(input)
    time_taken = time.time() - start_time
    time_list.append(time_taken)
    # compute the loss
    loss = criterion(output1, target)
    epoch_loss.append(loss.item())
    # compute the confusion matrix
    iouEvalVal.addBatch(output1.max(1)[1].data, target.data)
    print('[%d/%d] loss: %.3f time: %.4f' % (i, total_batches, loss.item(), time_taken))

average_epoch_loss_val = sum(epoch_loss) / len(epoch_loss)
overall_acc, per_class_acc, per_class_iu, mIOU = iouEvalVal.getMetric()
print('Average fps ',1/np.mean(np.array(time_list)))
return average_epoch_loss_val, overall_acc, per_class_acc, per_class_iu, mIOU

Do not include the first iteration because PyTorch has some initialization time

Yes you are right sir. I saw that too. I removed the first 5 samples for profiling and still the average is 53 fps.

What version of cudnn and cuda are you using?

CUDA
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

CUDNN version 7
libcudnn.so.7

Use below code to check cuDNN version. All versions does not have cuDNN optimized depth-wise convolutions.

import torch
print(torch.backends.cudnn.version())

import torch
print(torch.backends.cudnn.version())
7402

I assume you are using PyTorch 0.4+. COuld you try making following changes to your code and see what happens:

device=torch.device('cuda')
with torch.no_grad():
      for i, (input, _) in enumerate(val_loader):
            input = input.to(device=device)
            start_time = time.time()
            # run the mdoel
            output1 = model(input)
            time_taken = time.time() - start_time
            time_list.append(time_taken)

I also noticed that your GPU is different than the one we used. TitanX is considerably faster than GTX 1080, so that explains the difference in speed.

I did the changes you mentioned, however the fps is same. I understand that TitanX is faster. I thought some issue in implementation detail is causing this performance issue. But this is too much difference in fps i.e I got almost 3x less speed, I was not expecting this...

Anyhow, thanks for quick response and support. You can close this issue sir.
Really good paper sir.

How can I contact you ? for more information ?