Advices on Multi-GPU support?

Question

Advices on Multi-GPU support?

RichardKov opened this issue 7 years ago · 10 comments

Hi Ender, thanks for your work!

There have been some requests for multi-gpu support(e.g. #51). I am now trying to write a multi-gpu version based on your code.

However, after looking into the code, it seems that the current structure does not support multi-gpu well. For example. if I modify train_val.py in this way:

      with tf.variable_scope(tf.get_variable_scope()):
        for i in range(2):
            with tf.device("/gpu:" + str(i)):
                with tf.name_scope("tower_" + str(i)) as scope:
                    # Build the main computation graph
                    layers = self.net.create_architecture(sess,'TRAIN', self.num_classes, tag='default',
                                                          anchor_scales=cfg.ANCHOR_SCALES,
                                                          anchor_ratios=cfg.ANCHOR_RATIOS)
                    # Define the loss
                    loss = layers['total_loss']
                    losses.append(loss)
                    
                    tf.get_variable_scope().reuse_variables()
                    
                    grads = self.optimizer.compute_gradients(loss)
                    
                    tower_grads.append(grads)
                    scopes.append(scope)
      # Compute the gradients wrt the loss                  
      gvs = self.average_gradients(tower_grads)

It can not work because the network class has only one "self.image" so an error of

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'tower_0/Placeholder' with dtype float

will be throwed.

Can you give any advises of how to implement a multi-gpu version of this code?

many thanks.

Answer 1 · 2017-06-07T15:33:18.000Z

thanks for the effort! you will first need to dump a dataset to some tfrecord, tf slim has great support for multi-gpu training. i have been trying to do this for a long time but haven't really got into it yet.

Answer 2 · 2017-06-08T02:59:25.000Z

I see.
There seems to be an branch that support tfrecord here: philokey@3297a46
But we can't have summary on valid set if we build the network in this way:

      layers = self.net.create_architecture(sess, 'TRAIN', self.imdb.num_classes,
                                            image=image,
                                            im_info=tf.expand_dims(im_shape[1:], dim=0),
                                            gt_boxes=gt_boxes, tag='default',
                                            anchor_scales=cfg.ANCHOR_SCALES,
                                            anchor_ratios=cfg.ANCHOR_RATIOS)

Can you give some suggestions on how to use tf slim to implement a multi-gpu version, based on this branch? It seems tricky because your network is defined in a class...

Answer 3 · 2017-06-12T07:48:48.000Z

I seems that py_func do't support multi gpu yet. I try to use multi gpu by slim but failed.

Answer 4 · 2017-06-16T03:34:20.000Z

I think py_func may be a bottleneck. But I am not sure whether it supports multi gpu

Answer 5 · 2017-07-15T17:02:00.000Z

So have anyone implemented a version that supports multi gpu?

Answer 6 · 2017-08-22T08:45:41.000Z

...so why py_func is a bottleneck? what is the matter?

Answer 7 · 2017-09-06T08:58:56.000Z

Are your GPUs the same type

Answer 8 · 2017-10-13T02:42:24.000Z

I recently wrote one with multi-gpu support.
https://github.com/ppwwyyxx/tensorpack/tree/master/examples/FasterRCNN

Answer 9 · 2017-10-13T18:54:32.000Z

Wow thanks so much @ppwwyyxx! This looks amazing! closing this.

Answer 10 · 2019-09-14T01:09:58.000Z

It seems like the errors are caused by the nms() used in tf.py_func. When I changed it into py_nms, the errors are solved. However, the time complicity are increased a lot.