Advices on Multi-GPU support?
RichardKov opened this issue · 10 comments
Hi Ender, thanks for your work!
There have been some requests for multi-gpu support(e.g. #51). I am now trying to write a multi-gpu version based on your code.
However, after looking into the code, it seems that the current structure does not support multi-gpu well. For example. if I modify train_val.py in this way:
with tf.variable_scope(tf.get_variable_scope()):
for i in range(2):
with tf.device("/gpu:" + str(i)):
with tf.name_scope("tower_" + str(i)) as scope:
# Build the main computation graph
layers = self.net.create_architecture(sess,'TRAIN', self.num_classes, tag='default',
anchor_scales=cfg.ANCHOR_SCALES,
anchor_ratios=cfg.ANCHOR_RATIOS)
# Define the loss
loss = layers['total_loss']
losses.append(loss)
tf.get_variable_scope().reuse_variables()
grads = self.optimizer.compute_gradients(loss)
tower_grads.append(grads)
scopes.append(scope)
# Compute the gradients wrt the loss
gvs = self.average_gradients(tower_grads)
It can not work because the network class has only one "self.image" so an error of
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'tower_0/Placeholder' with dtype float
will be throwed.
Can you give any advises of how to implement a multi-gpu version of this code?
many thanks.
thanks for the effort! you will first need to dump a dataset to some tfrecord, tf slim has great support for multi-gpu training. i have been trying to do this for a long time but haven't really got into it yet.
I see.
There seems to be an branch that support tfrecord here: philokey@3297a46
But we can't have summary on valid set if we build the network in this way:
layers = self.net.create_architecture(sess, 'TRAIN', self.imdb.num_classes,
image=image,
im_info=tf.expand_dims(im_shape[1:], dim=0),
gt_boxes=gt_boxes, tag='default',
anchor_scales=cfg.ANCHOR_SCALES,
anchor_ratios=cfg.ANCHOR_RATIOS)
Can you give some suggestions on how to use tf slim to implement a multi-gpu version, based on this branch? It seems tricky because your network is defined in a class...
I seems that py_func
do't support multi gpu yet. I try to use multi gpu by slim but failed.
I think py_func may be a bottleneck. But I am not sure whether it supports multi gpu
So have anyone implemented a version that supports multi gpu?
...so why py_func is a bottleneck? what is the matter?
Are your GPUs the same type
I recently wrote one with multi-gpu support.
https://github.com/ppwwyyxx/tensorpack/tree/master/examples/FasterRCNN
Wow thanks so much @ppwwyyxx! This looks amazing! closing this.
It seems like the errors are caused by the nms() used in tf.py_func. When I changed it into py_nms, the errors are solved. However, the time complicity are increased a lot.