Pre-training on GPUs seems to be stuck

Question

Pre-training on GPUs seems to be stuck

tomohideshibata opened this issue 5 years ago · 3 comments

I have tried to perform pre-training from scratch on GPUs using the following command:
python run_pretraining.py --albert_config_file=albert_config.json --do_train --input_files=/somewhere/*/tf_examples.*.tfrecord --meta_data_file_path=/somewhere/train_meta_data --output_dir=/somewhere --strategy_type=mirror --train_batch_size=128 --num_train_epochs=2

But it seems to be stuck as follows:

...
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1209 00:48:14.076103 139679391237952 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I1209 00:48:24.566839 139679391237952 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I1209 00:48:45.377745 139679391237952 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
2019-12-09 00:49:16.104345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

GPUs are running, but no outputs are found.

The core of pre-training code is similar to the following tensorflow BERT code, and I have succeeded in running the following pre-training code.
https://github.com/tensorflow/models/tree/master/official/nlp/bert

My environment is as follows:

tensorflow-gpu==2.0.0
CUDA 10.0

Thanks in advance.

Answer 1 · 2019-12-09T12:10:42.000Z

After 20 hours later from starting, a log line was outputted, and so it wasn't stuck.

When I used the above tensorflow BERT code, log lines were outputted frequently. I am not sure why these are different.

Answer 2 · 2020-02-27T08:57:46.000Z

@tomohideshibata I am facing the same issue, can you suggest what changes I need to solve this issue?

Answer 3 · 2020-02-28T01:32:34.000Z

@008karan Hi. As in my above comment, a log line was outputted after 20 hours later from starting. (I changed nothing)
I think there is something strange in the pre-training codes.