Multi GPU mode is stuck at the beginning
Closed this issue · 14 comments
hi @affjljoo3581
Thank you very much for your work
when I run for demo,it stuck. but no --gpus it works well [only on my first gpu]
[root@gpu02]:~/kb/src# python -m gpt2 train --train_corpus ../build/corpus.train.txt \
--eval_corpus ../build/corpus.test.txt \ --vocab_path ../build/vocab.txt \ --dims 1024 \ --batch_train 128 \ --batch_eval 128 \ --seq_len 64 \ --total_steps 3000 \ --eval_steps 500 \ --save_steps 3000 \ --gpus 4
--save_checkpoint_path ckpt-gpt2.pth \
--save_model_path gpt2-pretrained.pth
Train GPT-2 model: 0%| | 0/3000 [00:00<?, ?it/s]
How to fix it so that the program goes on?
First of all, you did not append the backslash (\
) to the end of the --gpus 4
parameter line. Because of that, the arguments after the --gpu 4
line may be ignored. I think it is not a solution, but show me the result after fixing the bug first.
sorry,This is the format of the display
python -m gpt2 train --train_corpus ../build/corpus.train.txt \ --eval_corpus ../build/corpus.test.txt \ --vocab_path ../build/vocab.txt \ --dims 1024 \ --batch_train 128 \ --batch_eval 128 \ --seq_len 64 \ --total_steps 3000 \ --eval_steps 500 \ --save_steps 3000 \ --gpus 4 \ --save_checkpoint_path ckpt-gpt2.pth \ --save_model_path gpt2-pretrained.pth
and ENTER for run
it still stuck,as follows
Train GPT-2 model: 0%| | 0/3000 [00:00<?, ?it/s]
when run nvidia-smi,
The multi GPU seems to be up, but it's stuck
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3542447 C /root/anaconda3/bin/python 2315MiB |
| 1 3542448 C /root/anaconda3/bin/python 2320MiB |
| 2 3542449 C /root/anaconda3/bin/python 2315MiB |
| 3 3542450 C /root/anaconda3/bin/python 2320MiB |
+-----------------------------------------------------------------------------+
How long did you wait for the freeze? Due to the distributed training environment, it usually spends a few minutes before start training. In my case, 2x V100 required about 2 to 3 minutes.
It's been running for hours, so I canceled it
However, single GPU can run at a speed of 1.5it/s
What about two GPUs? Can you show me the result with 2 and 3 GPUs?
I ran this model on 2x V100s. I think distributed reduction would be the problem. Can you check if the GPU memory usage increases depending on the batch size?
And check if tcp port 8000 is available as well.
I think it is ommunication problem of multi GPU graphics card
Communication problem of multi GPU graphics card
When I set CUDA_VISIBLE_DEVICES=0,1 and --gpus 2 it don't work
But I set CUDA_VISIBLE_DEVICES=0,2 and --gpus 2 it works
Maybe 0 and 2 or 1 and 3 are able to communicate
I find Volatile GPU-Util is too low, Most of the time it is 10% even 0%
0% -> 10% -> 99%
How can I set it to work like DataLoader set num_workers
In my case, the GPU utilization was over 80% with my 2x V100s. Although my Dataset
class does not spawn the worker threads for fetching data from the corpus, it actually does not matter for the performance with the proper system (sufficient CPUs and RAMs) and suitable vocabulary size. How about testing the bottleneck of my Dataset
loader? Change the
Lines 28 to 50 in 71ebf91
function code as below:
def _fetch_one(self) -> Dict[str, List[int]]:
indices += [0] * (self.seq_len + 1)
return {'input': indices[:-1], 'output': indices[1:]}
Thank you very much
I think this may be another point I need to know
At present, my GPU is busy running
At the same time, I need to understand gpt2 deeply