Pretraining of albert from scratch is stuck

Question

Pretraining of albert from scratch is stuck

008karan opened this issue 5 years ago · 8 comments

I am doing pre-training from scratch. It seems that training is started as gpu's are being used but nothing is on terminal except this:

***** Number of cores used :  4 
I0227 09:00:31.841020 140137372948224 run_pretraining.py:226] Training using customized training loop TF 2.0 with distrubutedstrategy.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
I0227 09:00:44.563593 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:44.569019 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
I0227 09:00:45.620952 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:45.625989 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:46.679141 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:46.684157 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:47.734523 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:47.739573 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:57.697876 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:57.703157 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0227 09:01:07.835676 140137372948224 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0227 09:01:28.672055 140137372948224 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
2020-02-27 09:01:50.162839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

I tried on smaller text data also but same results.
@kamalkraj

Answer 1 · 2020-03-02T20:51:36.000Z

same problem here. 9 GPUs are available - no training at all

Answer 2 · 2020-03-03T07:52:29.000Z

I have tested on very small data (100kb) then it was showing results after the end of each epoch. I want to see results at every step. As on bigger data set its taking time so printing out at every step is required. I am still not able to figure out how to do it.
@kamalkraj @josegchen

Answer 3 · 2020-03-03T13:51:51.000Z

Mind you share us the parameters and setting in detail? Sent from my Huawei phone-------- Original message --------From: Karan Purohit <notifications@github.com>Date: Tue, Mar 3, 2020, 1:52 AMTo: "kamalkraj/ALBERT-TF2.0" <ALBERT-TF2.0@noreply.github.com>Cc: josegchen <josegchen@gmail.com>, Mention <mention@noreply.github.com>Subject: Re: [kamalkraj/ALBERT-TF2.0] Pretraining of albert from scratch is stuck (#36)I have tested on very small data (100kb) then it was showing results after the end of each epoch. I want to see results at every step. As on bigger data set its taking time so printing out at every step is required. I am still not able to figure out how to do it. @kamalkraj @josegchen —You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

Answer 4 · 2020-03-03T13:56:22.000Z

python run_pretraining.py --albert_config_file=model_configs/base/config.json --do_train --input_files=albert/* --meta_data_file_path=meta_data --output_dir=model_checkpoint/ --strategy_type=mirror --train_batch_size=8 --num_train_epochs=3

Answer 5 · 2020-03-03T16:38:44.000Z

I have tried with an 313MB tf_record file, it works on CPU only.

…

On Mar 3, 2020, at 7:56 AM, Karan Purohit ***@***.***> wrote: python run_pretraining.py --albert_config_file=model_configs/base/config.json --do_train --input_files=albert/* --meta_data_file_path=meta_data --output_dir=model_checkpoint/ --strategy_type=mirror --train_batch_size=8 --num_train_epochs=3 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#36?email_source=notifications&email_token=AHIXEC7YQT3EV2OWCZOREMLRFUEAPA5CNFSM4K4XMSMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENTSMLY#issuecomment-593962543>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHIXEC2PHQSGCQXKPKSEM23RFUEAPANCNFSM4K4XMSMA>.

Answer 6 · 2020-03-04T07:45:57.000Z

have you checked gpu usage? In my case, gpu is utilizing.

Answer 7 · 2020-03-04T15:08:46.000Z

It does show minor gpu utilize, see 1-2% for 7-8 GPUs and 25% for a single GPU for a very very short moment. However, the GPU memory are occupied. It halted eventually with a resource exhausted error.

…

On Mar 4, 2020, at 1:45 AM, Karan Purohit ***@***.***> wrote: have you checked gpu usage? In my case, gpu is utilizing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#36?email_source=notifications&email_token=AHIXECYLI3ANUAN22PXWVSLRFYBLNA5CNFSM4K4XMSMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENWWXBI#issuecomment-594373509>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHIXECZEP4VVQXAZRKXJSATRFYBLNANCNFSM4K4XMSMA>.

Answer 8 · 2021-01-26T04:00:25.000Z

have you checked gpu usage? In my case, gpu is utilizing.

Dear Karan,
I would like to know how did it go?
Were you able to pre-train using a single GPU?
Please share your experience!.