fab-jul/imgcomp-cvpr

train stop at"-STARTING TRAINING-------------"

qbzhu2020 opened this issue · 8 comments

I am sorry for bothering you again, but please allow me to show my issue for the last time. After I prepared all the environment, including the python packages and TFrecords, my training always stopped at the string "-STARTING TRAINING-------------", then it won't show any infomation at all ,it just stopped there, and will never finish itself.I don't know why.Here is my training command:

python train.py ae_configs/cvpr/low pc_configs/cvpr/res_shallow --restore
"/public/home/xqqstu/fab/code/ckpts/0515_1103 cvpr@low cvpr@res_shallow/ckpts"

maybe an issue with GPU? do you have one in the system?

I checked it again and found that I do have one! We can see it in the log: 2020-06-12 13:40:33.870168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
totalMemory: 31.75GiB freeMemory: 31.03GiB

image
And the picture above is the code stop place.

ok. what’s the training data?

and are you running this in your local machine or in some cloud / cluster

Thank you for your reply. Well, the dataset is the ImageNet, and I am running code on the CentOS7 of the cluster system. And this morning I‘ve tried running the code on the Windows system of my laptop, unexpectedly it succeeded at last. So I am very confused why it did't work in the cluster. Maybe some configurations on the cluster were wrong.

hm one issue could be that you don’t have enough RAM on the cluster. did you check this? you want probably at least 40GB

Following your advice, I searched on the google and found some suggestions such as reduce the batch-size, I changed the batch_size of your model form 30 to 16, and reuse the training command, it works finally!! Thank you very much!