ijkguo/mx-rcnn

Training with multiple-GPU is not faster

zdwong opened this issue · 3 comments

Thanks for your great job transferring py-faster-rcnn in caffe into mxnet. When installing and running the mx-rcnn. I found that two GPU training can't have nearly two times faster one GPU training.
platform: ubuntu 16.04, GPU: Tesla M60, 8G

bash script/vgg_voc07.sh 0

INFO:root:Epoch[0] Batch [20] Speed: 2.04 samples/sec Train-RPNAcc=0.894159, RPNLogLoss=0.361955, RPNL1Loss=1.139758, RCNNAcc=0.712054, RCNNLogLoss=1.508607, RCNNL1Loss=2.551116,
INFO:root:Epoch[0] Batch [40] Speed: 1.89 samples/sec Train-RPNAcc=0.927401, RPNLogLoss=0.283141, RPNL1Loss=1.018088, RCNNAcc=0.743521, RCNNLogLoss=1.378231, RCNNL1Loss=2.585749,
INFO:root:Epoch[0] Batch [60] Speed: 1.99 samples/sec Train-RPNAcc=0.941726, RPNLogLoss=0.229789, RPNL1Loss=0.936680, RCNNAcc=0.758965, RCNNLogLoss=1.284314, RCNNL1Loss=2.618034,
INFO:root:Epoch[0] Batch [80] Speed: 2.08 samples/sec Train-RPNAcc=0.945939, RPNLogLoss=0.203962, RPNL1Loss=0.934596, RCNNAcc=0.763503, RCNNLogLoss=1.227046, RCNNL1Loss=2.619250,
INFO:root:Epoch[0] Batch [100] Speed: 1.89 samples/sec Train-RPNAcc=0.942644, RPNLogLoss=0.211725, RPNL1Loss=0.920782, RCNNAcc=0.769183, RCNNLogLoss=1.197012, RCNNL1Loss=2.589773,

bash script/vgg_voc07.sh 0,1
INFO:root:Epoch[0] Batch [40] Speed: 2.10 samples/sec Train-RPNAcc=0.934642, RPNLogLoss=0.237217, RPNL1Loss=1.014563, RCNNAcc=0.766673, RCNNLogLoss=1.192775, RCNNL1Loss=2.580673,
INFO:root:Epoch[0] Batch [60] Speed: 2.15 samples/sec Train-RPNAcc=0.942495, RPNLogLoss=0.202506, RPNL1Loss=0.930434, RCNNAcc=0.777600, RCNNLogLoss=1.104864, RCNNL1Loss=2.590131,
INFO:root:Epoch[0] Batch [80] Speed: 2.26 samples/sec Train-RPNAcc=0.948712, RPNLogLoss=0.180862, RPNL1Loss=0.889647, RCNNAcc=0.792101, RCNNLogLoss=1.011266, RCNNL1Loss=2.562042,
INFO:root:Epoch[0] Batch [100] Speed: 2.17 samples/sec Train-RPNAcc=0.955039, RPNLogLoss=0.160886, RPNL1Loss=0.852715, RCNNAcc=0.793162, RCNNLogLoss=0.972027, RCNNL1Loss=2.572651

I wonder that this problem causes by data parallelization, but I found that you said this version has implemented it . So how this problem happen? Thanks for your replay.

I check it carefully, I make sure that generally multi-GPU training is faster than one-GPU training depending on hardware and platform.

i also notice this question. But in readme:
3.8 img/s to 6 img/s for 2 GPUs

Most of the time the bottleneck is custom layer proposal_target or data loading. Check dmlc/gluon-cv for a gluon implementation.