RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.
Closed this issue · 4 comments
I downloaded the code and dataset, and modified only anet.yaml, but I still have this problem, can you help me?
My environment and configuration:
torch 1.10.1
torchfile 0.1.0
torchnet 0.0.4
torchvision 0.11.2
dataset:
num_classes: 200
split: 75
training:
video_info_path: "./data/activitynet_annotations/video_info_new.csv"
video_anno_path: "./data/activitynet_annotations/anet_anno_action.json"
num_frame: 5
output_path: './path/to/train/'
testing:
video_info_path: "./data/activitynet_annotations/video_info_new.csv"
video_anno_path: "./data/activitynet_annotations/anet_anno_action.json"
num_frame: 5
output_path: './path/to/test/'
model:
embedding_head: 4
# feat_dim: 2048
feat_dim: 512
temporal_scale: 100
clip_pretrain: "O" ## K : KInetics , O : openAI
training:
batch_size: 100
learning_rate: 0.00004
weight_decay: 0.02
max_epoch: 5
checkpoint_path: './path/to/output/'
random_seed: 1
step: 10
gamma: 0.3
feature_path: "/disk/sdd/liuyang/ANet_CLIP"
num_gpu: 1
loss:
lambda_1: 0.6
lambda_2: 0.4
fewshot:
shot: 0 ## > 0 is few-shot ; = 0 is zero-shot
mode: 1 # 1 : base-training 2 : meta-training 3 : meta-testing 4 : no meta-training/ vanilla few-shot
trimmed: 0 # 0 : untrimmed 1 : trimmed
episode: 1000
num_base: 180
num_test: 20
ismulti : 1 # 0 : single-instance 1 : multi-instance
num_way : 4
meta_class : 1 # # 1: meta-learn classifier 0: vanilla few-shot w/o meta-learning
meta_mask : 0 # # 1: meta-learn mask 0: vanilla few-shot w/o meta-learning
trim_support : 1
num_context : 20
testing:
cls_thresh: 0.01
mask_thresh: [0,0.2,0.4,0.6,0.8]
class_thresh: [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
top_k_snip: 10
top_k: 500
nms_thresh: 0.6
pretraining:
video_transformer: "./path/to/ckpt"
isPretrain : 0 # 0 : Finetune , 1 : Pretrain
video_path: "/disk/sdd/liuyang/ANet_CLIP222"
raw_video: "/path/to/raw/video"
clip_length: 768
clip_stride: 8
emb_dim: 512
demo:
generated_feat_dir: "./path/to/feature"
Detailed error reporting
=========using KL Loss=and has temperature and * bz==========
Total Number of Learnable Paramters (in M) : 170.715992
No of Gpus using to Train : 1
Saving all Checkpoints in path : ./path/to/train/
No of videos in train is 6575
Loading train Video Information ...
No of class 150
100% 9649/9649 [00:01<00:00, 6946.91it/s]
No of videos in validation is 1094
Loading validation Video Information ...
No of class 50
100% 4728/4728 [00:00<00:00, 26635.37it/s]
stale_train.py:118: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
with autograd.detect_anomaly():
0 torch.Size([100, 512, 100]) torch.Size([100, 100]) torch.Size([100, 100, 100])
/home/ymy/code/ly/STALE-main/MaskFormer/mask_former/modeling/transformer/position_encoding.py:42: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)
/home/ymy/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
[W python_anomaly_mode.cpp:104] Warning: Error detected in DivBackward0. Traceback of forward call that caused the error:
File "stale_train.py", line 119, in <module>
train(train_loader, model, optimizer, epoch,scheduler)
File "stale_train.py", line 61, in train
loss = stale_loss(top_br_gt,top_br_pred,bottom_br_gt,bottom_br_pred,action_gt, mask_pred,bot_gt,cls_pred,label_gt,features,"train")
File "/home/ymy/code/ly/STALE-main/stale_lib/loss_stale.py", line 235, in stale_loss
red_loss = redundancy_loss(gt_action , pred_action, gt_cls, pred_cls, features)
File "/home/ymy/code/ly/STALE-main/stale_lib/loss_stale.py", line 215, in redundancy_loss
sim_loss += (1-cos_sim(top_feat,bot_feat))
File "/home/ymy/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ymy/miniconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/distance.py", line 77, in forward
return F.cosine_similarity(x1, x2, self.dim, self.eps)
(function _print_stack)
Traceback (most recent call last):
File "stale_train.py", line 119, in <module>
train(train_loader, model, optimizer, epoch,scheduler)
File "stale_train.py", line 64, in train
tot_loss.backward()
File "/home/ymy/miniconda3/envs/py38/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/ymy/miniconda3/envs/py38/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.
I also face this issues sometimes and sometimes my code runs fine.
I believe this issue is related to the empty initialisation of the background embedding tensor in the STALE model class.
self.bg_embeddings = nn.Parameter( torch.empty(1, 512) )
This means the tensor will have whatever values were already present in that memory block. These could be zeros, random numbers, or even nan or inf values, depending on what operations were performed in that memory space previously. This could also explain why the code works sometimes and not others.
One solution is to initialise the background embedding with values as such:
self.bg_embeddings = nn.Parameter( torch.empty(1, 512) ) torch.nn.init.kaiming_uniform_(self.bg_embeddings, nonlinearity='relu')
However I have not checked to see if there was a specific reason that the background embedding is left empty. @sauradip please could you let us know if this solution would be acceptable?
I believe this issue is related to the empty initialisation of the background embedding tensor in the STALE model class.
self.bg_embeddings = nn.Parameter( torch.empty(1, 512) )
This means the tensor will have whatever values were already present in that memory block. These could be zeros, random numbers, or even nan or inf values, depending on what operations were performed in that memory space previously. This could also explain why the code works sometimes and not others.
One solution is to initialise the background embedding with values as such:
self.bg_embeddings = nn.Parameter( torch.empty(1, 512) ) torch.nn.init.kaiming_uniform_(self.bg_embeddings, nonlinearity='relu')
However I have not checked to see if there was a specific reason that the background embedding is left empty. @sauradip please could you let us know if this solution would be acceptable?
Thanks for your advice @ed-fish ! I have tried this solution multi-runs and found it works, zero initialization may be the reason for the error "DivBackward0".
I believe this issue is related to the empty initialisation of the background embedding tensor in the STALE model class.
self.bg_embeddings = nn.Parameter( torch.empty(1, 512) )
This means the tensor will have whatever values were already present in that memory block. These could be zeros, random numbers, or even nan or inf values, depending on what operations were performed in that memory space previously. This could also explain why the code works sometimes and not others.
One solution is to initialise the background embedding with values as such:
self.bg_embeddings = nn.Parameter( torch.empty(1, 512) ) torch.nn.init.kaiming_uniform_(self.bg_embeddings, nonlinearity='relu')
However I have not checked to see if there was a specific reason that the background embedding is left empty. @sauradip please could you let us know if this solution would be acceptable?
Good Spot ! You can initializae anything. Sorry for this error. I will correct it