loss : nan when train custom data set
rezha130 opened this issue · 3 comments
Hi @BelBES
I tried several batch-size from 8,16,32,64,128,256..but always end with loss : nan
in every epoch when training my custom data set.
python train.py --data-path datatrain --test-init True --test-epoch 10 --output-dir snapshot --abc 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz:/. --batch-size 8
Test phase
acc: 0.0000; avg_ed: 0.0000: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 18.10it/s]
acc: 0.0 acc_best: 0; avg_ed: 18.428571428571427
epoch: 0; iter: 1998; lr: 1.0000000000000002e-06; loss_mean: nan; loss: nan: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:42<00:00, 46.69it/s]
epoch: 1; iter: 3998; lr: 1.0000000000000004e-10; loss_mean: nan; loss: nan: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:42<00:00, 46.74it/s]
epoch: 2; iter: 5998; lr: 1.0000000000000006e-14; loss_mean: nan; loss: nan: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:43<00:00, 45.84it/s]
I am using PyTorch 0.4, Python 3.6, GTX 1080 Ti and Ubuntu 16.04
Can you help me how to solve this problem?
Kindly Regards
Hi,
Can you provide small reproducer for this bug?
Sorry @BelBES , would you please explain about "small reproducer"?
FYI, this is structure of my custom data set:
datatrain
---- data
-------- folderA/img_filename_0.jpg
...
-------- folderB/img_filename_1.jpg
---- desc.json
And, this is structure of my custom desc.json
:
{
"abc": "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz:/.",
"train": [
{
"text": "text_on_image0"
"name": "folderA/img_filename_0.jpg"
},
...
{
"text": "text_on_image1"
"name": "folderB/img_filename_1.jpg"
}
],
"test": [
{
"text": "text_on_image3"
"name": "folderC/img_filename_3.jpg"
},
...
{
"text": "text_on_image4"
"name": "folderD/img_filename_4.jpg"
}
]
}
In text_data.py
, i used this syntax in line 32:
img = cv2.imread(os.path.join(self.data_path, "data", name))
But still have same loss : nan
issue. Please help.
When i tried to debug using cuda = false
(in CPU) on my dev laptop, this is the result of loss.data[0]
that cause loss : nan
[0]:<Tensor>
_backward_hooks:None
_base:<Tensor, len() = 1>
_cdata:140460563260592
_grad:None
_grad_fn:None
_version:0
data:<Tensor>
device:device(type='cpu')
dtype:torch.float32
grad:None
grad_fn:None
is_cuda:False
is_leaf:True
is_sparse:False
layout:torch.strided
name:None
output_nr:0
Note: i set cuda = False in my CPU dev-laptop, but set cuda = True on my GPU server above.