Ivan-Tang-3D/ViewRefer3D

RuntimeError: CUDA error: invalid argument

Morgansgun opened this issue · 6 comments

Hello again!Thanks for your reply previously,I finally finished the train process.But recently when I try to run the test.sh,some errors come again.
I ran the test.sh as your Readme says,the file is:

SR3D_GPT='/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/data/Sr3D_release.csv'
PATH_OF_SCANNET_FILE='/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/data/scanresult/keep_all_points_with_global_scan_alignment/keep_all_points_with_global_scan_alignment.pkl'
PATH_OF_REFERIT3D_FILE=${SR3D_GPT}
PATH_OF_BERT='/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/data/bert'

VIEW_NUM=4
EPOCH=100
DATA_NAME=SR3D
EXT=ViewRefer_test
DECODER=4
NAME=${DATA_NAME}_${VIEW_NUM}view_${EPOCH}ep_${EXT}
TRAIN_FILE=train_referit3d

TYPE=reserved
python -u /home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/scripts/${TRAIN_FILE}.py \
--mode evaluate \
-scannet-file ${PATH_OF_SCANNET_FILE} \
-referit3D-file ${PATH_OF_REFERIT3D_FILE} \
--bert-pretrain-path ${PATH_OF_BERT} \
--log-dir logs/results/${NAME} \
--resume-path '/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/logs/results/SR3D_4view_100ep_ViewRefer/12-07-2023-11-19-59/checkpoints/best_model.pth'\
--model 'referIt3DNet_transformer' \
--unit-sphere-norm True \
--batch-size 6 \
--n-workers 4 \
--max-train-epochs ${EPOCH} \
--encoder-layer-num 3 \
--decoder-layer-num ${DECODER} \
--decoder-nhead-num 8 \
--view_number ${VIEW_NUM} \
--rotate_number 4 \
--label-lang-sup True > ./logs/results/${NAME}.log 2>&1 &

And the file can run for a while ,then it will break at the same place everytime :
100%|█████████▉| 1476/1478 [04:02<00:00, 6.34it/s]
100%|█████████▉| 1477/1478 [04:02<00:00, 6.32it/s]
100%|██████████| 1478/1478 [04:02<00:00, 7.01it/s]
100%|██████████| 1478/1478 [04:02<00:00, 6.09it/s]

0%| | 0/1478 [00:00<?, ?it/s]
0%| | 0/1478 [00:01<?, ?it/s]

And the error is :
Traceback (most recent call last):
File "/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/scripts/train_referit3d.py", line 291, in
args, out_file=out_file,tokenizer=tokenizer)
File "/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/analysis/deepnet_predictions.py", line 42, in analyze_predictions
net_stats = detailed_predictions_on_dataset(model, d_loader, args=args, device=device, FOR_VISUALIZATION=True,tokenizer=tokenizer)
File "/home/sd/anaconda3/envs/viewrefer/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/models/referit3d_net_utils.py", line 205, in detailed_predictions_on_dataset
batch[k] = batch[k].to(device)
RuntimeError: CUDA error: invalid argument

Why the training process is smooth ,but the error occurs during the test?

The reason is that the analyze_predictions function of deepnet_predictions file is not involved in the training process but in the test process. My advice is to ipdb at the 205 line of referit3d_net_utils file. Because of the recent business, I would find the error case in the following days.

The reason is that the analyze_predictions function of deepnet_predictions file is not involved in the training process but in the test process. My advice is to ipdb at the 205 line of referit3d_net_utils file. Because of the recent business, I would find the error case in the following days.

Thanks for your reply! I will try your advice and see if it works.
Here I find another problem: in the file "prepare_referential_data.py" has"from referit3d.in_out.sr3d import load_sr3d_raw_data" line14.But I actually don't find the definition of load_sr3d_raw_data in sr3d.py,so it can't be imported.

Sorry, I push the wrong file, whose content is three_d_obejct.py. You could refer to this link: https://github.com/sega-hsj/MVT-3DVG/blob/main/referit3d/in_out/sr3d.py

I have revised the content of sr3d.py

I have revised the content of sr3d.py

OK!But I didn't find the way to slove the first error yet, I retrained a model, didn't work.

It might be related with k in batch[k]