training issue

Question

training issue

Opened this issue 7 years ago · 6 comments

For Acceleration, I modified the body net(res50) in the prototxt and added some custom caffe layers(it works fine and passed some tests already)and
now I'm trying to train FastMask from the scratch with COCO(without pre trained res-50 model).
Currently, I put coco data in an external hard disk)

/home/lee/FastMask/... << git cloned.
/home/lee/caffe ... << my own caffe ( I use this caffe. I copied all the files committed by voidrank to here, and compiled successfully).
/media/lee/xxxxxx/coco/... << coco data here.

lee@lee-All-Series:~/FastMask$ python train.py 0 fm-res39
I already modified the train.py to know where is my caffe and coco API.
sys.path.append(os.path.abspath("/home/lee/coco/PythonAPI"))
sys.path.append(os.path.abspath("/home/lee/caffe/python"))
sys.path.append(os.path.abspath("python_layers"))
sys.path.append(os.path.abspath("/home/lee/FastMask"))

and finally, I faced errors as below.

WARNING: Logging before InitGoogleLogging() is written to STDERR
E0830 01:31:46.447471 2199 common.cpp:114] Cannot create Cublas handle. Cublas won't be available.
E0830 01:31:46.448586 2199 common.cpp:121] Cannot create Curand generator. Curand won't be available.
F0830 01:31:46.449659 2199 common.cpp:152] Check failed: error == cudaSuccess (30 vs. 0) unknown error
*** Check failure stack trace: ***
Aborted (core dumped)

Is this because I put coco data in a ext disk?
I wonder, if I put coco data in an external disk, how can I make it to know where it is? there is no option for it.
or can you guys guess what is the reason?
Please help me.

Answer 1 · 2017-08-29T20:14:41.000Z

Hi, @dedoogong
Maybe you can take a look at this link

Answer 2 · 2017-08-30T08:57:05.000Z

thank you for help,
by the way, I got out of memory. How much do I need? I have 8GB x 2 SLI. can I use both by specifying gpu_id to 0, 1 ? or can I reduce the batch size in prototxt or config json file? Please help me.

Answer 3 · 2017-08-30T21:10:21.000Z

@dedoogong
Unfortunately, this version of pycaffe doesn't support multi-gpu mode.
You probably need 12G. Or you may reduce the input size of image from 1200 to 800.
Look at this link

Answer 4 · 2017-08-31T14:53:22.000Z

really really thank you for your kind. Can you tell me what option I should set for this?
I tried to change the json file several times as below
"OBJN_BATCH_SIZE": 64 to 1, 16, 32, .. so on
"MASK_BATCH_SIZE": 64 to 1, 16, 32, .. so on
"TEST_SCALE": 870 to 400, 500, 600,... so on
"SCALE": 800 to 1, 20, 400, .. so on
"MASK_SIZE": 160 to 1, 16, 32.. so on

but failed...Please help me one more time.
Thank you very much.

Answer 5 · 2017-09-05T08:52:26.000Z

You should see this.
If you still can't solve it, please provide more details about the error. I'm glad to help you.

Answer 6 · 2020-03-04T09:05:46.000Z

@dedoogong Did you understand the meaning of parameters ? would you like to share with me?