occur 'nan' in training?

Question

occur 'nan' in training?

stoneyang159 opened this issue 4 years ago · 2 comments

when i tried to train the model in this repo, loss value is nan. Is there something wrong in my config file?

here is my training config :

# coding:utf-8
Net:
  net_type: 'ResNet'
  n_class: 3
Data:
  train_dir: "../data"
  val_dir: "../data"
  train_name: '300w_lp_for_rank.txt'
  val_name: 'aflw2000_filename.txt'
  train_type: 'RANK_300W'
  val_type: 'AFLW2000'
  target_size: 224
Train:
  max_epoch: 80
  batch_size: 64
  num_workers: 6
  test_every: 1
  resume: False
  pretrained_path:
  use_bined: False
  use_rank: True
Loss:
  loss_type: 'RANK'
Optimizer:
  mode: 'adam'
  base_lr: 0.0005
  t_max: 10

thanks a lot in advance.

Answer 1 · 2020-06-25T10:46:31.000Z

A bit late, but I had the same problem and I think I found out why: the last weight vector of the model is not initialized.

In src/models/resnet.py, line 31, it is created as:

self.w = nn.Parameter(torch.Tensor(2048, n_class))

Initializating it like torch.nn.Linear weights:

self.w = torch.empty(2048, n_class)
nn.init.uniform_(self.w, -math.sqrt(1/2048), math.sqrt(1/2048))
self.w = nn.Parameter(self.w)

solved the problem for me.

Answer 2 · 2021-01-01T02:54:16.000Z

@ntalabot thanks a lot!