jiwoon-ahn/psa

Learning rate

Opened this issue · 9 comments

DQDH commented

I don't know how to get the parameters of the Ours-ResNet segmentation network. Can you give a explain for the parameters ?Thanks.

I tried change the learning rate to 0.01,and the batchsize 4,the loss is decreased to 0.0403,only within one epoch(Iter:37000/39675,a epoch almost finised but failed),but the program often cause a error like this:
validating ... terminate called after throwing an instance of 'std::system_error'
what(): Operation not permitted

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

So,is there any parameters need to change?or any advice on the error?

I tried change the learing rate to 0.01,and batchsize=4,and due to the limit of GPU resource,I set
model = torch.nn.DataParallel(model,device_ids=[0]),but after there is alwayes a error like this:
Iter:36900/39675 Loss:0.0413 imps:3.5 Fin:Mon Oct 22 03:56:00 2018 lr: 0.0009
Iter:36950/39675 Loss:0.0363 imps:3.5 Fin:Mon Oct 22 03:55:56 2018 lr: 0.0009
Iter:37000/39675 Loss:0.0403 imps:3.5 Fin:Mon Oct 22 03:55:53 2018 lr: 0.0009

validating ... terminate called after throwing an instance of 'std::system_error'
what(): Operation not permitted

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

the program can't finish a epoch,but the loss is decreased to 0.0403,is it accepted or need more epoch?
Do you have any advise on this error?

@hardBird123, I trained by Adam setting initial learning rate as 0.001. But I didn't try to find the optimal learning rate. You can get better results than mine by adopting SGD or just following the method described in https://arxiv.org/pdf/1611.10080.pdf.

@LeiyuanMa, Sorry, I can't help you with that error. Probably related to the memory leak. In my case, training epochs do not change the performance a lot. And I haven't tested training 15 epochs is the best for the network.

DQDH commented

ok, thanks. I want to confirm that the weights [ilsvrc-cls_rna-a1_cls1000_ep-0001.params] is the pretrained weights for training segmentation network ResNet38?

@hardBird123, Yes, that is the right file for the segmentation network.

thanks,so is the loss=0.0403 acceptable?

DQDH commented

which lr_type(fixed(default)/step/linear) should I choose when training the ResNet38 segmentation network?

hello, I'm a student who running this code. And there is a running error. Can you give me some tips about this issue.
2019-02-24 204758

RuntimeError: size mismatch, m1: [1 x 20], m2: [1 x 20]