Question about Loss is infinite or NaN
Opened this issue · 6 comments
Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply.
Traceback (most recent call last):
File "train.py", line 238, in
main(args)
File "train.py", line 165, in main
trainer.train()
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train
super().train(self.start_epoch, self.max_epoch)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train
self.run_epoch()
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch
loss_summary = self.forward_backward(batch)
File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward
self.model_backward_and_update(loss)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update
self.model_backward(loss)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward
self.detect_anomaly(loss)
File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly
raise FloatingPointError("Loss is infinite or NaN!")
FloatingPointError: Loss is infinite or NaN!
This is base2new_train_flowers.sh.
#!/bin/bash
# custom config
DATA=DATA
TRAINER=TCP
WEIGHT=1.0
CFG=vit_b16_ep100_ctxv1
CTP=end # class token position (end or middle)
NCTX=4 # number of context tokens
SHOTS=16 # number of shots (1, 2, 4, 8, 16)
CSC=False # class-specific context (False or True)
FOLDER=output_flowers
for SEED in 1 2 3
do
DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED}
if [ -d "$DIR" ]; then
echo "Results are available in ${DIR}. Skip this job"
else
echo "Run this job and save the output to ${DIR}"
set CUDA_VISIBLE_DEVICES=0
python train.py \
--root ${DATA} \
--seed ${SEED} \
--trainer ${TRAINER} \
--dataset-config-file configs/datasets/oxford_flowers.yaml \
--config-file configs/trainers/${TRAINER}/${CFG}.yaml \
--output-dir ${DIR} \
TRAINER.COOP.N_CTX ${NCTX} \
TRAINER.COOP.CSC ${CSC} \
TRAINER.COOP.W ${WEIGHT} \
TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
DATASET.NUM_SHOTS ${SHOTS} \
DATASET.SUBSAMPLE_CLASSES base
fi
done
Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!
This is base2new_train_flowers.sh.
#!/bin/bash # custom config DATA=DATA TRAINER=TCP WEIGHT=1.0 CFG=vit_b16_ep100_ctxv1 CTP=end # class token position (end or middle) NCTX=4 # number of context tokens SHOTS=16 # number of shots (1, 2, 4, 8, 16) CSC=False # class-specific context (False or True) FOLDER=output_flowers for SEED in 1 2 3 do DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED} if [ -d "$DIR" ]; then echo "Results are available in ${DIR}. Skip this job" else echo "Run this job and save the output to ${DIR}" set CUDA_VISIBLE_DEVICES=0 python train.py \ --root ${DATA} \ --seed ${SEED} \ --trainer ${TRAINER} \ --dataset-config-file configs/datasets/oxford_flowers.yaml \ --config-file configs/trainers/${TRAINER}/${CFG}.yaml \ --output-dir ${DIR} \ TRAINER.COOP.N_CTX ${NCTX} \ TRAINER.COOP.CSC ${CSC} \ TRAINER.COOP.W ${WEIGHT} \ TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \ DATASET.NUM_SHOTS ${SHOTS} \ DATASET.SUBSAMPLE_CLASSES base fi done
Do you have to adjust the EPS in the optimizer?
Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!
This is base2new_train_flowers.sh.#!/bin/bash # custom config DATA=DATA TRAINER=TCP WEIGHT=1.0 CFG=vit_b16_ep100_ctxv1 CTP=end # class token position (end or middle) NCTX=4 # number of context tokens SHOTS=16 # number of shots (1, 2, 4, 8, 16) CSC=False # class-specific context (False or True) FOLDER=output_flowers for SEED in 1 2 3 do DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED} if [ -d "$DIR" ]; then echo "Results are available in ${DIR}. Skip this job" else echo "Run this job and save the output to ${DIR}" set CUDA_VISIBLE_DEVICES=0 python train.py \ --root ${DATA} \ --seed ${SEED} \ --trainer ${TRAINER} \ --dataset-config-file configs/datasets/oxford_flowers.yaml \ --config-file configs/trainers/${TRAINER}/${CFG}.yaml \ --output-dir ${DIR} \ TRAINER.COOP.N_CTX ${NCTX} \ TRAINER.COOP.CSC ${CSC} \ TRAINER.COOP.W ${WEIGHT} \ TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \ DATASET.NUM_SHOTS ${SHOTS} \ DATASET.SUBSAMPLE_CLASSES base fi done
Do you have to adjust the EPS in the optimizer?
yes,I have adjusted eps to 1e-3.
Dear auther, I am a entry-level novice, and I have some question that I would like to ask you for advice. I modified .sh file for running on oxford_flowers. But after I started running,there is an error reported as follows. I am looking forward to your reply. Traceback (most recent call last): File "train.py", line 238, in main(args) File "train.py", line 165, in main trainer.train() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 393, in train super().train(self.start_epoch, self.max_epoch) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 256, in train self.run_epoch() File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 603, in run_epoch loss_summary = self.forward_backward(batch) File "D:\PycharmProjects\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward self.model_backward_and_update(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 308, in model_backward_and_update self.model_backward(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 297, in model_backward self.detect_anomaly(loss) File "d:\pycharmprojects\coop\coop-main\dassl.pytorch\dassl\engine\trainer.py", line 229, in detect_anomaly raise FloatingPointError("Loss is infinite or NaN!") FloatingPointError: Loss is infinite or NaN!
This is base2new_train_flowers.sh.#!/bin/bash # custom config DATA=DATA TRAINER=TCP WEIGHT=1.0 CFG=vit_b16_ep100_ctxv1 CTP=end # class token position (end or middle) NCTX=4 # number of context tokens SHOTS=16 # number of shots (1, 2, 4, 8, 16) CSC=False # class-specific context (False or True) FOLDER=output_flowers for SEED in 1 2 3 do DIR=${FOLDER}_${NCTX}/base2new/train_base/oxford_flowers/shots_${SHOTS}_${WEIGHT}/${TRAINER}/${CFG}/seed${SEED} if [ -d "$DIR" ]; then echo "Results are available in ${DIR}. Skip this job" else echo "Run this job and save the output to ${DIR}" set CUDA_VISIBLE_DEVICES=0 python train.py \ --root ${DATA} \ --seed ${SEED} \ --trainer ${TRAINER} \ --dataset-config-file configs/datasets/oxford_flowers.yaml \ --config-file configs/trainers/${TRAINER}/${CFG}.yaml \ --output-dir ${DIR} \ TRAINER.COOP.N_CTX ${NCTX} \ TRAINER.COOP.CSC ${CSC} \ TRAINER.COOP.W ${WEIGHT} \ TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \ DATASET.NUM_SHOTS ${SHOTS} \ DATASET.SUBSAMPLE_CLASSES base fi done
yes,I have adjusted eps to 1e-3.
During my experiments, the loss NaN is all caused by the Adam optimizer. Can you provide the log file?
I got around to the same problem even though I modified the eps.The log file has not been generated.
Traceback (most recent call last):
File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\train.py", line 238, in
main(args)
File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\train.py", line 165, in main
trainer.train()
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 386, in train
super().train(self.start_epoch, self.max_epoch)
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 250, in train
self.run_epoch()
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 597, in run_epoch
loss_summary = self.forward_backward(batch)
File "D:\project\deeplearning\Textual-based_Class-aware_prompt_tuning-main\trainers\tcp.py", line 328, in forward_backward
self.model_backward_and_update(loss)
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 302, in model_backward_and_update
self.model_backward(loss)
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 291, in model_backward
self.detect_anomaly(loss)
File "d:\project\deeplearning\dassl.pytorch-master\dassl\engine\trainer.py", line 223, in detect_anomaly
raise FloatingPointError("Loss is infinite or NaN!")
FloatingPointError: Loss is infinite or NaN!
I found that line 81 plus this paragraph was not executed, I typed the sentence eps=1e-3 under line 88, now it can run through