sauradip/STALE

RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

Closed this issue · 4 comments

We are trying to fix this. In the current version this bug exists and can appear 1 out of 5 times and can happen at max 2 times at a stretch.

A simple solution to this to re-run the experiment ( only appears during training )

Solution : Putting .float() after CLIP model initialization. This error is due to mixed precision training

The Nan bug is due to the with autograd.detect_anomaly(): function

I downloaded the code and dataset, and modified only anet.yaml, but I still have this problem, can you help me?

My environment and configuration:

torch                    1.10.1
torchfile                0.1.0
torchnet                 0.0.4
torchvision              0.11.2
dataset:
  num_classes: 200
  split: 75
  training:
    video_info_path: "./data/activitynet_annotations/video_info_new.csv"
    video_anno_path: "./data/activitynet_annotations/anet_anno_action.json"
    num_frame: 5
    output_path: './path/to/train/'

  testing:
    video_info_path: "./data/activitynet_annotations/video_info_new.csv"
    video_anno_path: "./data/activitynet_annotations/anet_anno_action.json"
    num_frame: 5
    output_path: './path/to/test/'

model:
  embedding_head: 4
  # feat_dim: 2048
  feat_dim: 512
  temporal_scale: 100
  clip_pretrain: "O" ## K : KInetics , O : openAI

training:
  batch_size: 100
  learning_rate: 0.00004
  weight_decay: 0.02
  max_epoch: 5
  checkpoint_path: './path/to/output/'
  random_seed: 1
  step: 10
  gamma: 0.3
  feature_path: "/disk/sdd/liuyang/ANet_CLIP"
  num_gpu: 1

loss:
  lambda_1: 0.6
  lambda_2: 0.4

fewshot:
  shot: 0 ## > 0 is few-shot ;  = 0 is zero-shot
  mode: 1 # 1 : base-training 2 : meta-training 3 : meta-testing 4 : no meta-training/ vanilla few-shot
  trimmed: 0 # 0 : untrimmed 1 : trimmed
  episode: 1000
  num_base: 180
  num_test: 20
  ismulti : 1 # 0 : single-instance 1 : multi-instance
  num_way : 4
  meta_class : 1 # # 1: meta-learn classifier 0: vanilla few-shot w/o meta-learning
  meta_mask : 0 # # 1: meta-learn mask 0: vanilla few-shot w/o meta-learning
  trim_support : 1
  num_context : 20

testing:
  cls_thresh: 0.01
  mask_thresh: [0,0.2,0.4,0.6,0.8]
  class_thresh: [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
  top_k_snip: 10
  top_k: 500
  nms_thresh: 0.6

pretraining:
  video_transformer: "./path/to/ckpt"
  isPretrain : 0 # 0 : Finetune , 1 : Pretrain
  video_path: "/disk/sdd/liuyang/ANet_CLIP222"
  raw_video: "/path/to/raw/video"
  clip_length: 768
  clip_stride: 8
  emb_dim: 512

demo:
  generated_feat_dir: "./path/to/feature"