endernewton/tf-faster-rcnn

Advices on Multi-GPU support?

RichardKov opened this issue · 10 comments

Hi Ender, thanks for your work!

There have been some requests for multi-gpu support(e.g. #51). I am now trying to write a multi-gpu version based on your code.

However, after looking into the code, it seems that the current structure does not support multi-gpu well. For example. if I modify train_val.py in this way:

      with tf.variable_scope(tf.get_variable_scope()):
        for i in range(2):
            with tf.device("/gpu:" + str(i)):
                with tf.name_scope("tower_" + str(i)) as scope:
                    # Build the main computation graph
                    layers = self.net.create_architecture(sess,'TRAIN', self.num_classes, tag='default',
                                                          anchor_scales=cfg.ANCHOR_SCALES,
                                                          anchor_ratios=cfg.ANCHOR_RATIOS)
                    # Define the loss
                    loss = layers['total_loss']
                    losses.append(loss)
                    
                    tf.get_variable_scope().reuse_variables()
                    
                    grads = self.optimizer.compute_gradients(loss)
                    
                    tower_grads.append(grads)
                    scopes.append(scope)
      # Compute the gradients wrt the loss                  
      gvs = self.average_gradients(tower_grads)

It can not work because the network class has only one "self.image" so an error of

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'tower_0/Placeholder' with dtype float

will be throwed.

Can you give any advises of how to implement a multi-gpu version of this code?

many thanks.

thanks for the effort! you will first need to dump a dataset to some tfrecord, tf slim has great support for multi-gpu training. i have been trying to do this for a long time but haven't really got into it yet.

I see.
There seems to be an branch that support tfrecord here: philokey@3297a46
But we can't have summary on valid set if we build the network in this way:

      layers = self.net.create_architecture(sess, 'TRAIN', self.imdb.num_classes,
                                            image=image,
                                            im_info=tf.expand_dims(im_shape[1:], dim=0),
                                            gt_boxes=gt_boxes, tag='default',
                                            anchor_scales=cfg.ANCHOR_SCALES,
                                            anchor_ratios=cfg.ANCHOR_RATIOS)

Can you give some suggestions on how to use tf slim to implement a multi-gpu version, based on this branch? It seems tricky because your network is defined in a class...

I seems that py_func do't support multi gpu yet. I try to use multi gpu by slim but failed.

I think py_func may be a bottleneck. But I am not sure whether it supports multi gpu

So have anyone implemented a version that supports multi gpu?

...so why py_func is a bottleneck? what is the matter?

Are your GPUs the same type

Wow thanks so much @ppwwyyxx! This looks amazing! closing this.

It seems like the errors are caused by the nms() used in tf.py_func. When I changed it into py_nms, the errors are solved. However, the time complicity are increased a lot.