train stop at"-STARTING TRAINING-------------"

Question

train stop at"-STARTING TRAINING-------------"

qbzhu2020 opened this issue 4 years ago · 8 comments

qbzhu2020 commented 4 years ago

I am sorry for bothering you again, but please allow me to show my issue for the last time. After I prepared all the environment, including the python packages and TFrecords, my training always stopped at the string "-STARTING TRAINING-------------", then it won't show any infomation at all ,it just stopped there, and will never finish itself.I don't know why.Here is my training command:

Answer 1 · 2020-06-10T10:52:19.000Z

python train.py ae_configs/cvpr/low pc_configs/cvpr/res_shallow --restore
"/public/home/xqqstu/fab/code/ckpts/0515_1103 cvpr@low cvpr@res_shallow/ckpts"

Answer 2 · 2020-06-11T10:18:17.000Z

maybe an issue with GPU? do you have one in the system?

Answer 3 · 2020-06-12T05:57:47.000Z

I checked it again and found that I do have one! We can see it in the log: 2020-06-12 13:40:33.870168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
totalMemory: 31.75GiB freeMemory: 31.03GiB

And the picture above is the code stop place.

Answer 4 · 2020-06-12T07:41:19.000Z

ok. what’s the training data?

Answer 5 · 2020-06-12T07:41:37.000Z

and are you running this in your local machine or in some cloud / cluster

Answer 6 · 2020-06-12T08:23:11.000Z

Thank you for your reply. Well, the dataset is the ImageNet, and I am running code on the CentOS7 of the cluster system. And this morning I‘ve tried running the code on the Windows system of my laptop, unexpectedly it succeeded at last. So I am very confused why it did't work in the cluster. Maybe some configurations on the cluster were wrong.

Answer 7 · 2020-06-12T08:24:33.000Z

hm one issue could be that you don’t have enough RAM on the cluster. did you check this? you want probably at least 40GB

Answer 8 · 2020-06-12T08:57:12.000Z

Following your advice, I searched on the google and found some suggestions such as reduce the batch-size, I changed the batch_size of your model form 30 to 16, and reuse the training command, it works finally!! Thank you very much!