sauradip/STALE

RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

Closed this issue · 4 comments

I downloaded the code and dataset, and modified only anet.yaml, but I still have this problem, can you help me?

My environment and configuration:

torch                    1.10.1
torchfile                0.1.0
torchnet                 0.0.4
torchvision              0.11.2
dataset:
  num_classes: 200
  split: 75
  training:
    video_info_path: "./data/activitynet_annotations/video_info_new.csv"
    video_anno_path: "./data/activitynet_annotations/anet_anno_action.json"
    num_frame: 5
    output_path: './path/to/train/'

  testing:
    video_info_path: "./data/activitynet_annotations/video_info_new.csv"
    video_anno_path: "./data/activitynet_annotations/anet_anno_action.json"
    num_frame: 5
    output_path: './path/to/test/'

model:
  embedding_head: 4
  # feat_dim: 2048
  feat_dim: 512
  temporal_scale: 100
  clip_pretrain: "O" ## K : KInetics , O : openAI

training:
  batch_size: 100
  learning_rate: 0.00004
  weight_decay: 0.02
  max_epoch: 5
  checkpoint_path: './path/to/output/'
  random_seed: 1
  step: 10
  gamma: 0.3
  feature_path: "/disk/sdd/liuyang/ANet_CLIP"
  num_gpu: 1

loss:
  lambda_1: 0.6
  lambda_2: 0.4

fewshot:
  shot: 0 ## > 0 is few-shot ;  = 0 is zero-shot
  mode: 1 # 1 : base-training 2 : meta-training 3 : meta-testing 4 : no meta-training/ vanilla few-shot
  trimmed: 0 # 0 : untrimmed 1 : trimmed
  episode: 1000
  num_base: 180
  num_test: 20
  ismulti : 1 # 0 : single-instance 1 : multi-instance
  num_way : 4
  meta_class : 1 # # 1: meta-learn classifier 0: vanilla few-shot w/o meta-learning
  meta_mask : 0 # # 1: meta-learn mask 0: vanilla few-shot w/o meta-learning
  trim_support : 1
  num_context : 20

testing:
  cls_thresh: 0.01
  mask_thresh: [0,0.2,0.4,0.6,0.8]
  class_thresh: [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
  top_k_snip: 10
  top_k: 500
  nms_thresh: 0.6

pretraining:
  video_transformer: "./path/to/ckpt"
  isPretrain : 0 # 0 : Finetune , 1 : Pretrain
  video_path: "/disk/sdd/liuyang/ANet_CLIP222"
  raw_video: "/path/to/raw/video"
  clip_length: 768
  clip_stride: 8
  emb_dim: 512

demo:
  generated_feat_dir: "./path/to/feature"

Detailed error reporting

=========using KL Loss=and has temperature and * bz==========

Total Number of Learnable Paramters (in M) :  170.715992
No of Gpus using to Train :  1 
 Saving all Checkpoints in path : ./path/to/train/
No of videos in train is 6575
Loading train Video Information ...
No of class 150
100% 9649/9649 [00:01<00:00, 6946.91it/s]
No of videos in validation is 1094
Loading validation Video Information ...
No of class 50
100% 4728/4728 [00:00<00:00, 26635.37it/s]
stale_train.py:118: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with autograd.detect_anomaly():
0 torch.Size([100, 512, 100]) torch.Size([100, 100]) torch.Size([100, 100, 100])
/home/ymy/code/ly/STALE-main/MaskFormer/mask_former/modeling/transformer/position_encoding.py:42: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)
/home/ymy/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
  warnings.warn(warning.format(ret))
[W python_anomaly_mode.cpp:104] Warning: Error detected in DivBackward0. Traceback of forward call that caused the error:
  File "stale_train.py", line 119, in <module>
    train(train_loader, model, optimizer, epoch,scheduler)
  File "stale_train.py", line 61, in train
    loss = stale_loss(top_br_gt,top_br_pred,bottom_br_gt,bottom_br_pred,action_gt, mask_pred,bot_gt,cls_pred,label_gt,features,"train")
  File "/home/ymy/code/ly/STALE-main/stale_lib/loss_stale.py", line 235, in stale_loss
    red_loss = redundancy_loss(gt_action , pred_action, gt_cls, pred_cls, features)
  File "/home/ymy/code/ly/STALE-main/stale_lib/loss_stale.py", line 215, in redundancy_loss
    sim_loss += (1-cos_sim(top_feat,bot_feat))
  File "/home/ymy/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ymy/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/distance.py", line 77, in forward
    return F.cosine_similarity(x1, x2, self.dim, self.eps)
 (function _print_stack)
Traceback (most recent call last):
  File "stale_train.py", line 119, in <module>
    train(train_loader, model, optimizer, epoch,scheduler)
  File "stale_train.py", line 64, in train
    tot_loss.backward()
  File "/home/ymy/miniconda3/envs/py38/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/ymy/miniconda3/envs/py38/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

I also face this issues sometimes and sometimes my code runs fine.

I believe this issue is related to the empty initialisation of the background embedding tensor in the STALE model class.

self.bg_embeddings = nn.Parameter( torch.empty(1, 512) )

This means the tensor will have whatever values were already present in that memory block. These could be zeros, random numbers, or even nan or inf values, depending on what operations were performed in that memory space previously. This could also explain why the code works sometimes and not others.

One solution is to initialise the background embedding with values as such:

self.bg_embeddings = nn.Parameter( torch.empty(1, 512) ) torch.nn.init.kaiming_uniform_(self.bg_embeddings, nonlinearity='relu')

However I have not checked to see if there was a specific reason that the background embedding is left empty. @sauradip please could you let us know if this solution would be acceptable?

I believe this issue is related to the empty initialisation of the background embedding tensor in the STALE model class.

self.bg_embeddings = nn.Parameter( torch.empty(1, 512) )

This means the tensor will have whatever values were already present in that memory block. These could be zeros, random numbers, or even nan or inf values, depending on what operations were performed in that memory space previously. This could also explain why the code works sometimes and not others.

One solution is to initialise the background embedding with values as such:

self.bg_embeddings = nn.Parameter( torch.empty(1, 512) ) torch.nn.init.kaiming_uniform_(self.bg_embeddings, nonlinearity='relu')

However I have not checked to see if there was a specific reason that the background embedding is left empty. @sauradip please could you let us know if this solution would be acceptable?

Thanks for your advice @ed-fish ! I have tried this solution multi-runs and found it works, zero initialization may be the reason for the error "DivBackward0".

I believe this issue is related to the empty initialisation of the background embedding tensor in the STALE model class.

self.bg_embeddings = nn.Parameter( torch.empty(1, 512) )

This means the tensor will have whatever values were already present in that memory block. These could be zeros, random numbers, or even nan or inf values, depending on what operations were performed in that memory space previously. This could also explain why the code works sometimes and not others.

One solution is to initialise the background embedding with values as such:

self.bg_embeddings = nn.Parameter( torch.empty(1, 512) ) torch.nn.init.kaiming_uniform_(self.bg_embeddings, nonlinearity='relu')

However I have not checked to see if there was a specific reason that the background embedding is left empty. @sauradip please could you let us know if this solution would be acceptable?

Good Spot ! You can initializae anything. Sorry for this error. I will correct it