Raschka-research-group/coral-cnn

While traning my model i'm facing issue.

NarasimmanSaravana1994 opened this issue · 5 comments

Epoch: 001/200 | Batch 0000/20149 | Cost: 70.1415
Epoch: 001/200 | Batch 0050/20149 | Cost: 59.7190
Epoch: 001/200 | Batch 0100/20149 | Cost: 56.4751
Epoch: 001/200 | Batch 0150/20149 | Cost: 58.4821
Epoch: 001/200 | Batch 0200/20149 | Cost: 56.8452
Epoch: 001/200 | Batch 0250/20149 | Cost: 59.0936
Epoch: 001/200 | Batch 0300/20149 | Cost: 54.9184
Epoch: 001/200 | Batch 0350/20149 | Cost: 53.4635
Epoch: 001/200 | Batch 0400/20149 | Cost: 52.2409
Epoch: 001/200 | Batch 0450/20149 | Cost: 51.1332
Epoch: 001/200 | Batch 0500/20149 | Cost: 57.5054
Epoch: 001/200 | Batch 0550/20149 | Cost: 53.7109
Epoch: 001/200 | Batch 0600/20149 | Cost: 58.1618
Epoch: 001/200 | Batch 0650/20149 | Cost: 53.6513
Epoch: 001/200 | Batch 0700/20149 | Cost: 55.9161
Epoch: 001/200 | Batch 0750/20149 | Cost: 55.2700
Epoch: 001/200 | Batch 0800/20149 | Cost: 52.1431
Epoch: 001/200 | Batch 0850/20149 | Cost: 54.5851
Epoch: 001/200 | Batch 0900/20149 | Cost: 62.3357
Epoch: 001/200 | Batch 0950/20149 | Cost: 53.9224
Epoch: 001/200 | Batch 1000/20149 | Cost: 57.4987
Epoch: 001/200 | Batch 1050/20149 | Cost: 59.1612
Epoch: 001/200 | Batch 1100/20149 | Cost: 52.0190
Epoch: 001/200 | Batch 1150/20149 | Cost: 59.5060
Epoch: 001/200 | Batch 1200/20149 | Cost: 57.0917
Epoch: 001/200 | Batch 1250/20149 | Cost: 53.7502
Epoch: 001/200 | Batch 1300/20149 | Cost: 62.6665
Epoch: 001/200 | Batch 1350/20149 | Cost: 50.6539
Epoch: 001/200 | Batch 1400/20149 | Cost: 51.1941
Traceback (most recent call last):
File "afad-coral.py", line 379, in
for batch_idx, (features, targets, levels) in enumerate(train_loader):
File "/home/administrator/gender_identification/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in next
return self._process_data(data)
File "/home/administrator/gender_identification/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
data.reraise()
File "/home/administrator/gender_identification/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/administrator/gender_identification/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/administrator/gender_identification/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/administrator/gender_identification/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 80, in default_collate
return [default_collate(samples) for samples in transposed]
File "/home/administrator/gender_identification/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 80, in
return [default_collate(samples) for samples in transposed]
File "/home/administrator/gender_identification/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 56, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 98 and 99 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:689

While training my custom dataset model I'm facing the issue . Is there any way to identify the issue file >

@rasbt @yienxu Please guide me.

rasbt commented

I don't think this issue is related to CORAL; you may have some issues with your dataset because it cannot complete the first epoch. I suggest you iterate over your custom dataset and check the tensor sizes of features and targets to see if they are inconsistent somewhere.

ahm7 commented

I am facing the same error
did you solve it ?

I think you need to consider the fact that the ages in your dataset probably aren't starting from 0. Thus when you create the levels maybe try using :
label = self.age[index] - k
levels = [1]label + [0](NUM_CLASSES - 1 - label)
where k is the minimum age in your dataset. Otherwise the NUM_CLASSES - 1 - label quantity becomes negative for the last k ages and you end up having mismatched dimensional vectors/tensors.
I was facing the same error and this was the issue. I am not sure if you're having the same issue but you can try this.

rasbt commented

Good point, there was a related issue here #22

I.e., make sure that the labels start at 0 by subtracting "min(age)" from all labels during training. Then, to make predictions, just add "min(age)" back to the predicted label.

For example, if you have ages between 20-50, subtract "20" from all training examples. Then, if you predict on new data and the model predicts 5, then the "real" label is 5+20 = 25.

I should note that having labels starting at 0 is not only a requirement for CORAL but for regular classification (cross entropy loss) as well -- here, it's due to how PyTorch internally considers the one-hot targets of the class labels when computing the cross entropy loss.

The issue was resolved while I updated the code in my locally.....