bert model use the way of collectiveallreduce to train in multi-gpu

Question

bert model use the way of collectiveallreduce to train in multi-gpu

weizhifei12345 opened this issue 5 years ago · 2 comments

Dear author：
I find that you have achived the method which is mirrorStrategy to train the bert in multi-gpu.Now,I want to use the way of collectiveallreduce to train in multi-gpu.I use the tf.contrib.distribute.CollectiveAllReduceStrategy(num_gpus_per_worker=2) to set the distributation.Then i use the function of train_and_evaluate to start the process of train,but encounter a problem unsupported operand type(s) for +:'perreplica' and 'str'.I don't know how to solve it.(I have 2 gpus which is V100)

Answer 1 · 2019-07-16T09:24:49.000Z

You may need to provide more information of the problem. Which version of TensorFlow are you using?

Answer 2 · 2019-07-19T12:48:58.000Z

Thank you for your reply. I have solved this problem. It is my cluster IP that is not set properly.