Multi GPU mode is stuck at the beginning

hi @affjljoo3581
Thank you very much for your work
when I run for demo,it stuck. but no --gpus it works well [only on my first gpu]
[root@gpu02]:~/kb/src# python -m gpt2 train --train_corpus ../build/corpus.train.txt \

                 --eval_corpus            ../build/corpus.test.txt \
                 --vocab_path             ../build/vocab.txt \
                 --dims                   1024 \
                 --batch_train            128 \
                 --batch_eval             128 \
                 --seq_len                64 \
                 --total_steps            3000 \
                 --eval_steps             500 \
                 --save_steps             3000 \
                 --gpus                   4

                 --save_checkpoint_path   ckpt-gpt2.pth \
                 --save_model_path        gpt2-pretrained.pth

Train GPT-2 model: 0%| | 0/3000 [00:00<?, ?it/s]
How to fix it so that the program goes on？

First of all, you did not append the backslash (\) to the end of the --gpus 4 parameter line. Because of that, the arguments after the --gpu 4 line may be ignored. I think it is not a solution, but show me the result after fixing the bug first.

sorry,This is the format of the display
python -m gpt2 train --train_corpus ../build/corpus.train.txt \ --eval_corpus ../build/corpus.test.txt \ --vocab_path ../build/vocab.txt \ --dims 1024 \ --batch_train 128 \ --batch_eval 128 \ --seq_len 64 \ --total_steps 3000 \ --eval_steps 500 \ --save_steps 3000 \ --gpus 4 \ --save_checkpoint_path ckpt-gpt2.pth \ --save_model_path gpt2-pretrained.pth

and ENTER for run
it still stuck，as follows

Train GPT-2 model: 0%| | 0/3000 [00:00<?, ?it/s]

when run nvidia-smi,

The multi GPU seems to be up, but it's stuck
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3542447 C /root/anaconda3/bin/python 2315MiB |
| 1 3542448 C /root/anaconda3/bin/python 2320MiB |
| 2 3542449 C /root/anaconda3/bin/python 2315MiB |
| 3 3542450 C /root/anaconda3/bin/python 2320MiB |
+-----------------------------------------------------------------------------+

How long did you wait for the freeze? Due to the distributed training environment, it usually spends a few minutes before start training. In my case, 2x V100 required about 2 to 3 minutes.

It's been running for hours, so I canceled it
However, single GPU can run at a speed of 1.5it/s

What about two GPUs? Can you show me the result with 2 and 3 GPUs?

I've tested that --gpus from 2 to 4
There was no improvement
Maybe dataloader doesn't allow multithreading

I ran this model on 2x V100s. I think distributed reduction would be the problem. Can you check if the GPU memory usage increases depending on the batch size?

And check if tcp port 8000 is available as well.

the GPU memory usage looks like OK when i set batch size is 64
and port 8000 is available

I think it is ommunication problem of multi GPU graphics card
Communication problem of multi GPU graphics card
When I set CUDA_VISIBLE_DEVICES=0,1 and --gpus 2 it don't work
But I set CUDA_VISIBLE_DEVICES=0,2 and --gpus 2 it works
Maybe 0 and 2 or 1 and 3 are able to communicate

I find Volatile GPU-Util is too low, Most of the time it is 10% even 0%
0% -> 10% -> 99%
How can I set it to work like DataLoader set num_workers

In my case, the GPU utilization was over 80% with my 2x V100s. Although my Dataset class does not spawn the worker threads for fetching data from the corpus, it actually does not matter for the performance with the proper system (sufficient CPUs and RAMs) and suitable vocabulary size. How about testing the bottleneck of my Dataset loader? Change the

GPT2/src/gpt2/data/corpus.py

Lines 28 to 50 in 71ebf91

    
           def _fetch_one(self) -> Dict[str, List[int]]: 
        
               while True: 
        
                   # Read subword-tokenized sequence from corpus. 
        
                   line = self.corpus_fp.readline() 
        
                   if not line: 
        
                       # Raise error when all sequences are fetched. 
        
                       if not self.repeat: 
        
                           raise StopIteration() 
        
                       # Or, move to the first of the corpus. 
        
                       self.corpus_fp.seek(0) 
        
                       continue 
        
                   # Use token indices rather than the token names directly. 
        
                   indices = [self.vocab[t] for t in line.split()] 
        
                   if len(indices) + 2 > self.seq_len: 
        
                       continue 
        
                   # Decorate the sequence with additional tokens. 
        
                   indices = [self.vocab.bos_idx] + indices + [self.vocab.eos_idx] 
        
                   indices += [self.vocab.pad_idx] * (self.seq_len - len(indices) + 1) 
        
                   return {'input': indices[:-1], 'output': indices[1:]}

function code as below:

    def _fetch_one(self) -> Dict[str, List[int]]:
        indices += [0] * (self.seq_len + 1)
        return {'input': indices[:-1], 'output': indices[1:]}

Thank you very much
I think this may be another point I need to know
At present, my GPU is busy running
At the same time, I need to understand gpt2 deeply

	def _fetch_one(self) -> Dict[str, List[int]]:
	while True:
	# Read subword-tokenized sequence from corpus.
	line = self.corpus_fp.readline()
	if not line:
	# Raise error when all sequences are fetched.
	if not self.repeat:
	raise StopIteration()

	# Or, move to the first of the corpus.
	self.corpus_fp.seek(0)
	continue

	# Use token indices rather than the token names directly.
	indices = [self.vocab[t] for t in line.split()]
	if len(indices) + 2 > self.seq_len:
	continue

	# Decorate the sequence with additional tokens.
	indices = [self.vocab.bos_idx] + indices + [self.vocab.eos_idx]
	indices += [self.vocab.pad_idx] * (self.seq_len - len(indices) + 1)

	return {'input': indices[:-1], 'output': indices[1:]}