JayYip/m3tl

bert model use the way of collectiveallreduce to train in multi-gpu

weizhifei12345 opened this issue · 2 comments

Dear author:
I find that you have achived the method which is mirrorStrategy to train the bert in multi-gpu.Now,I want to use the way of collectiveallreduce to train in multi-gpu.I use the tf.contrib.distribute.CollectiveAllReduceStrategy(num_gpus_per_worker=2) to set the distributation.Then i use the function of train_and_evaluate to start the process of train,but encounter a problem unsupported operand type(s) for +:'perreplica' and 'str'.I don't know how to solve it.(I have 2 gpus which is V100)

You may need to provide more information of the problem. Which version of TensorFlow are you using?

Thank you for your reply. I have solved this problem. It is my cluster IP that is not set properly.