An error occurs when running on the source data

Question

An error occurs when running on the source data

nuistji opened this issue 2 years ago · 1 comments

Excellent work! I met an error when training the pre-trained model, it shows like that:

Backbone # param.: 23561205
Learnable # param.: 2580968
Total # param.: 26142173

available GPUs: 1

Total (trn) images are : 13680
Total (val) images are : 0
[Epoch: 00] [Batch: 0001/0684] L: 0.68586 Avg L: 0.68586 mIoU: 0.04 | FB-IoU: 37.67
[Epoch: 00] [Batch: 0051/0684] L: 0.40346 Avg L: 0.52482 mIoU: 0.00 | FB-IoU: 38.23
[Epoch: 00] [Batch: 0101/0684] L: 0.37764 Avg L: 0.46712 mIoU: 12.09 | FB-IoU: 45.07
[Epoch: 00] [Batch: 0151/0684] L: 0.46177 Avg L: 0.44117 mIoU: 21.00 | FB-IoU: 50.16
[Epoch: 00] [Batch: 0201/0684] L: 0.46565 Avg L: 0.42391 mIoU: 25.59 | FB-IoU: 53.12
[Epoch: 00] [Batch: 0251/0684] L: 0.39822 Avg L: 0.41749 mIoU: 28.23 | FB-IoU: 54.64
[Epoch: 00] [Batch: 0301/0684] L: 0.30663 Avg L: 0.41056 mIoU: 30.69 | FB-IoU: 55.98
[Epoch: 00] [Batch: 0351/0684] L: 0.31297 Avg L: 0.39998 mIoU: 32.92 | FB-IoU: 57.27
[Epoch: 00] [Batch: 0401/0684] L: 0.40371 Avg L: 0.39329 mIoU: 34.65 | FB-IoU: 58.27
[Epoch: 00] [Batch: 0451/0684] L: 0.30678 Avg L: 0.38822 mIoU: 35.85 | FB-IoU: 59.02
[Epoch: 00] [Batch: 0501/0684] L: 0.33376 Avg L: 0.38439 mIoU: 36.85 | FB-IoU: 59.63
[Epoch: 00] [Batch: 0551/0684] L: 0.49764 Avg L: 0.38038 mIoU: 37.92 | FB-IoU: 60.25
[Epoch: 00] [Batch: 0601/0684] L: 0.37989 Avg L: 0.37663 mIoU: 38.79 | FB-IoU: 60.69
[Epoch: 00] [Batch: 0651/0684] L: 0.34453 Avg L: 0.37386 mIoU: 39.91 | FB-IoU: 61.22

*** Training [@epoch 00] Avg L: 0.37309 mIoU: 40.20 FB-IoU: 61.40 ***

Traceback (most recent call last):
File "train.py", line 97, in
val_loss, val_miou, val_fb_iou = train(epoch, model, dataloader_val, optimizer, training=False)
File "train.py", line 27, in train
for idx, batch in enumerate(dataloader):
File "/home/jifanfan/anaconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/jifanfan/anaconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/jifanfan/anaconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/jifanfan/anaconda3/envs/torch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/jifanfan/PATNet/data/pascal.py", line 32, in getitem
idx %= len(self.img_metadata) # for testing, as n_images < 1000
ZeroDivisionError: integer division or modulo by zero

I think the reason for this is that the validation data cannot be loaded. I think the following code leads to this error (For the self.fold of vallidation data is 4).

img_metadata = []
if self.split == 'trn': # For training, read image-metadata of "the other" folds
for fold_id in range(self.nfolds):
if fold_id == self.fold: # Skip validation fold
continue
img_metadata += read_metadata(self.split, fold_id)

    elif self.split == 'val':  # For validation, read image-metadata of "current" fold
        if self.fold != 4:
            img_metadata = read_metadata(self.split, self.fold)
    else:
        raise Exception('Undefined split %s: ' % self.split)

I use the pytorch 1.7+cuda11.0
Thank you for yout time.

Answer 1 · 2022-11-16T19:58:55.000Z

Yes, you're right. It's because your validation split is null. The default training dataset is the whole PASCAL dataset. Thus your validation dataset should be different from the training dataset. I updated the train.py file and revised the default setting for your better understanding. Right now, the validation set is the val split of FSS-1000 dataset. And you can revise it for your experiment setting :)