bes-dev/crnn-pytorch

loss : nan when train custom data set

rezha130 opened this issue · 3 comments

Hi @BelBES

I tried several batch-size from 8,16,32,64,128,256..but always end with loss : nan in every epoch when training my custom data set.

python train.py --data-path datatrain --test-init True --test-epoch 10 --output-dir snapshot --abc 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz:/. --batch-size 8

Test phase
acc: 0.0000; avg_ed: 0.0000: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 18.10it/s]
acc: 0.0	acc_best: 0; avg_ed: 18.428571428571427
epoch: 0; iter: 1998; lr: 1.0000000000000002e-06; loss_mean: nan; loss: nan: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:42<00:00, 46.69it/s]
epoch: 1; iter: 3998; lr: 1.0000000000000004e-10; loss_mean: nan; loss: nan: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:42<00:00, 46.74it/s]
epoch: 2; iter: 5998; lr: 1.0000000000000006e-14; loss_mean: nan; loss: nan: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:43<00:00, 45.84it/s]

I am using PyTorch 0.4, Python 3.6, GTX 1080 Ti and Ubuntu 16.04

Can you help me how to solve this problem?

Kindly Regards

Hi,

Can you provide small reproducer for this bug?

Sorry @BelBES , would you please explain about "small reproducer"?

FYI, this is structure of my custom data set:

datatrain
---- data
-------- folderA/img_filename_0.jpg
...
-------- folderB/img_filename_1.jpg
---- desc.json

And, this is structure of my custom desc.json:

{
"abc": "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz:/.",
"train": [
{
"text": "text_on_image0"
"name": "folderA/img_filename_0.jpg"
},
...
{
"text": "text_on_image1"
"name": "folderB/img_filename_1.jpg"
}
],
"test": [
{
"text": "text_on_image3"
"name": "folderC/img_filename_3.jpg"
},
...
{
"text": "text_on_image4"
"name": "folderD/img_filename_4.jpg"
}
]
}

In text_data.py, i used this syntax in line 32:
img = cv2.imread(os.path.join(self.data_path, "data", name))

But still have same loss : nan issue. Please help.

When i tried to debug using cuda = false (in CPU) on my dev laptop, this is the result of loss.data[0] that cause loss : nan

[0]:<Tensor>
_backward_hooks:None
_base:<Tensor, len() = 1>
_cdata:140460563260592
_grad:None
_grad_fn:None
_version:0
data:<Tensor>
device:device(type='cpu')
dtype:torch.float32
grad:None
grad_fn:None
is_cuda:False
is_leaf:True
is_sparse:False
layout:torch.strided
name:None
output_nr:0

Note: i set cuda = False in my CPU dev-laptop, but set cuda = True on my GPU server above.