The main difference between original bert and bert-muti-gpu are these lines below?

Question

The main difference between original bert and bert-muti-gpu are these lines below?

Closed this issue 5 years ago · 10 comments

TPF2017 commented 5 years ago

    tf.logging.info("Use normal RunConfig")
    dist_strategy = tf.contrib.distribute.MirroredStrategy(
        num_gpus=FLAGS.num_gpu_cores,
        cross_device_ops=AllReduceCrossDeviceOps('nccl', num_packs=FLAGS.num_gpu_cores),
    )
    log_every_n_steps = 8
    run_config = RunConfig(
        train_distribute=dist_strategy,
        eval_distribute=dist_strategy,
        log_step_count_steps=log_every_n_steps,
        model_dir=FLAGS.output_dir,
        save_checkpoints_steps=FLAGS.save_checkpoints_steps)

Answer 1 · 2019-09-11T05:46:02.000Z

How can I change the original bert for multi-gpu fine-tuning? Thank you!

Answer 2 · 2019-09-11T16:09:25.000Z

You should use tf.contrib.distribute.MirroredStrategy and implement AdamWeightDecayOptimizer yourself, because the original code implemented by google does not support distributed strategy.
If you are using bert-multi-gpu,
you only need to pass --use_gpu=true and --num_gpu_cores <GPUs> to the entry script to enable multi-GPU support.

Answer 3 · 2019-09-12T02:55:31.000Z

Thanks a lot for replying!

Answer 4 · 2019-09-12T06:40:15.000Z

when I add these lines in run_seq_labeling.py 693-699:
accuracy = tf.metrics.accuracy(label_ids, predictions, output_mask)
loss = tf.metrics.mean(per_example_loss)
return {
"eval_accuracy": accuracy,
"eval_loss": loss,
"precision": tf_metrics.precision(label_ids, predictions, num_labels, positions=positions, average='macro'),
"recall": tf_metrics.recall(label_ids, predictions, num_labels, num_labels, positions=positions,average='macro'),
"f1_score": tf_metrics.f1(label_ids, predictions, num_labels, num_labels, positions=positions, average='macro'),
}
the code for tf_metrics(multiclass f1 score) from https://github.com/guillaumegenthial/tf_metrics/blob/master/tf_metrics/__init__.py
I meeting an error:
TypeError: Fetch argument PerReplica:{
0 /job:localhost/replica:0/task:0/device:GPU:0: <tf.Tensor 'Mean_2:0' shape=() dtype=float32>
1 /job:localhost/replica:0/task:0/device:GPU:1: <tf.Tensor 'replica_1/Mean_2:0' shape=() dtype=float32>} has type <class 'tensorflow.python.distribute.values.PerReplica'>, must be a string or Tensor. (Can not convert a PerReplica into a Tensor or Operation.)
Do you know what's wrong here, is there a better way for multiclass score evaluate?

Answer 5 · 2019-09-12T11:24:57.000Z

It seems that MirroredStrategy is not compatible with tf_metrics. You can confirm this issue with the author.

Answer 6 · 2019-10-12T11:50:17.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Answer 7 · 2020-05-14T02:43:11.000Z

You should use tf.contrib.distribute.MirroredStrategy and implement AdamWeightDecayOptimizer yourself, because the original code implemented by google does not support distributed strategy.
If you are using bert-multi-gpu,
you only need to pass --use_gpu=true and --num_gpu_cores <GPUs> to the entry script to enable multi-GPU support.

支持多卡的AdamWeightDecayOptimizer这块代码的改造，有什么教程可以参考吗？如果想换一个Optimizer的话？

Answer 8 · 2020-05-14T04:45:20.000Z

You should use tf.contrib.distribute.MirroredStrategy and implement AdamWeightDecayOptimizer yourself, because the original code implemented by google does not support distributed strategy.
If you are using bert-multi-gpu,
you only need to pass --use_gpu=true and --num_gpu_cores <GPUs> to the entry script to enable multi-GPU support.

支持多卡的AdamWeightDecayOptimizer这块代码的改造，有什么教程可以参考吗？如果想换一个Optimizer的话？

可以参考本项目custom_optimization.py里的AdamWeightDecayOptimizer与官方项目中的异同，原来的AdamWeightDecayOptimizer没有实现分布式的部分。

Answer 9 · 2022-03-30T08:55:04.000Z

You should use tf.contrib.distribute.MirroredStrategy and implement AdamWeightDecayOptimizer yourself, because the original code implemented by google does not support distributed strategy.
If you are using bert-multi-gpu,
you only need to pass --use_gpu=true and --num_gpu_cores <GPUs> to the entry script to enable multi-GPU support.

支持多卡的AdamWeightDecayOptimizer这块代码的改造，有什么教程可以参考吗？如果想换一个Optimizer的话？

可以参考本项目custom_optimization.py里的AdamWeightDecayOptimizer与官方项目中的异同，原来的AdamWeightDecayOptimizer没有实现分布式的部分。

@haoyuhu 您好，您的项目很赞！
有个问题想请教一下：自定义的AdamWeightDecayOptimizer调用的是原生的apply_gradients（不像bert调用的是自己的apply_gradients，所以没有做global_step+1），所以这里是不是不用+1（https://github.com/HaoyuHu/bert-multi-gpu/blob/master/custom_optimization.py#L104）

Answer 10 · 2022-08-10T05:27:20.000Z

You should use tf.contrib.distribute.MirroredStrategy and implement AdamWeightDecayOptimizer yourself, because the original code implemented by google does not support distributed strategy.
If you are using bert-multi-gpu,
you only need to pass --use_gpu=true and --num_gpu_cores <GPUs> to the entry script to enable multi-GPU support.

支持多卡的AdamWeightDecayOptimizer这块代码的改造，有什么教程可以参考吗？如果想换一个Optimizer的话？

可以参考本项目custom_optimization.py里的AdamWeightDecayOptimizer与官方项目中的异同，原来的AdamWeightDecayOptimizer没有实现分布式的部分。

@haoyuhu 您好，您的项目很赞！有个问题想请教一下：自定义的AdamWeightDecayOptimizer调用的是原生的apply_gradients（不像bert调用的是自己的apply_gradients，所以没有做global_step+1），所以这里是不是不用+1（https://github.com/HaoyuHu/bert-multi-gpu/blob/master/custom_optimization.py#L104）

应该是需要的，AdamWeightDecayOptimizer中的实现并不是原生apply_gradients。另外，此处考虑了fp16的场景，不收敛时global_step不应递增。