occur 'nan' in training?
stoneyang159 opened this issue · 2 comments
stoneyang159 commented
when i tried to train the model in this repo, loss value is nan. Is there something wrong in my config file?
here is my training config :
# coding:utf-8
Net:
net_type: 'ResNet'
n_class: 3
Data:
train_dir: "../data"
val_dir: "../data"
train_name: '300w_lp_for_rank.txt'
val_name: 'aflw2000_filename.txt'
train_type: 'RANK_300W'
val_type: 'AFLW2000'
target_size: 224
Train:
max_epoch: 80
batch_size: 64
num_workers: 6
test_every: 1
resume: False
pretrained_path:
use_bined: False
use_rank: True
Loss:
loss_type: 'RANK'
Optimizer:
mode: 'adam'
base_lr: 0.0005
t_max: 10
thanks a lot in advance.
ntalabot commented
A bit late, but I had the same problem and I think I found out why: the last weight vector of the model is not initialized.
In src/models/resnet.py
, line 31, it is created as:
self.w = nn.Parameter(torch.Tensor(2048, n_class))
Initializating it like torch.nn.Linear
weights:
self.w = torch.empty(2048, n_class)
nn.init.uniform_(self.w, -math.sqrt(1/2048), math.sqrt(1/2048))
self.w = nn.Parameter(self.w)
solved the problem for me.
stoneyang159 commented
@ntalabot thanks a lot!