Question when use apex
Closed this issue · 2 comments
South-Twilight commented
Hi, when I use apex to train a model on 4 gpus and batch_size=16
in config.yaml
, the train.log
shows below:
[train]: 6%|▋ | 157283/2500000 [00:07<5077:11:43, 7.80s/it]
[train]: 6%|▋ | 157283/2500000 [00:07<5062:26:00, 7.78s/it]
[train]: 6%|▋ | 157283/2500000 [00:07<4995:54:26, 7.68s/it]
[train]: 6%|▋ | 157283/2500000 [00:07<5001:31:14, 7.69s/it]
[train]: 6%|▋ | 157284/2500000 [00:09<2837:29:46, 4.36s/it]
[train]: 6%|▋ | 157284/2500000 [00:09<2843:41:43, 4.37s/it]
[train]: 6%|▋ | 157284/2500000 [00:09<2810:08:46, 4.32s/it]
[train]: 6%|▋ | 157284/2500000 [00:09<2812:15:34, 4.32s/it]
[train]: 6%|▋ | 157285/2500000 [00:11<2025:45:21, 3.11s/it]
[train]: 6%|▋ | 157285/2500000 [00:11<2029:06:23, 3.12s/it]
[train]: 6%|▋ | 157285/2500000 [00:11<2010:52:27, 3.09s/it]
[train]: 6%|▋ | 157285/2500000 [00:11<2012:00:48, 3.09s/it]
I'm not sure that the checkpoint-150000steps.pkl
means train the model
either 150000*4=600000steps && batch_size=16
or 150000steps && batch_size=16*4=64
.
I'm looking forward to your reply.
kan-bayashi commented
150000steps && batch_size=16*4=64
This one.
South-Twilight commented
Thanks a lot.