songlab-cal/tape

can't start pretrain with --nproc_per_node > 1

FTD007 opened this issue · 2 comments

torch version 1.2
cuda 10.0
command used:
tape-train-distributed transformer masked_language_modeling --batch_size 512 --learning_rate 1e-3 --fp16 --warmup_steps 10 --nproc_per_node 4 --gradient_accumulation_steps 32

I am able to run with --nproc_per_node set to one. If it is set to be larger than 1, it seems to be idle forever, no log output at all. output when ctrl C

^CTraceback (most recent call last):
File "/dartfs-hpc/rc/home/w/f00355w/.conda/envs/tape/bin/tape-train-distributed", line 8, in
sys.exit(run_train_distributed())
File "/dartfs-hpc/rc/lab/C/CBKlab/Bdai/bertprotein/tape/main.py", line 252, in run_train_distributed
args.node_rank, args.master_addr, args.master_port)
File "/dartfs-hpc/rc/lab/C/CBKlab/Bdai/bertprotein/tape/utils/distributed_utils.py", line 169, in launch_process_group
while not process_context.join():
File "/dartfs-hpc/rc/lab/C/CBKlab/Bdai/bertprotein/tape/utils/distributed_utils.py", line 82, in join
timeout=timeout,
File "/dartfs-hpc/rc/home/w/f00355w/.conda/envs/tape/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/dartfs-hpc/rc/home/w/f00355w/.conda/envs/tape/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/dartfs-hpc/rc/home/w/f00355w/.conda/envs/tape/lib/python3.6/multiprocessing/popen_fork.py", line 29, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

below is the output of nvidia-smi when run with --nproc_per_node 1

|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:83:00.0 Off | 0 |
| N/A 65C P0 62W / 149W | 9920MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:84:00.0 Off | 0 |
| N/A 25C P8 34W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 00000000:8A:00.0 Off | 0 |
| N/A 33C P8 26W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 00000000:8B:00.0 Off | 0 |
| N/A 31C P8 32W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 801 C ...e/w/f00355w/.conda/envs/tape/bin/python 9907MiB |
+-----------------------------------------------------------------------------+

Thought I should be able to set --nproc_per_node 4, or do I need to do anything related to --nnodes ?

rmrao commented

TLDR I'm probably going to deprecate TAPE's training code and recommend people switch to a proper training framework.

So since the time TAPE was released, there have been a number of training frameworks released for pytorch. I have switched to using pytorch lightning and fairseq, depending on the project. DeepSpeed is also a good choice. It is difficult to maintain the framework for TAPE's training that was written before these frameworks were released, especially with continual updates to hardware and changes to the way pytorch distributed works. So I'd recommend trying to use one of these frameworks.

If you're specifically just trying to train transformer models, I'd strongly recommend either fairseq or DeepSpeed, which have excellent implementations for very fast multi-GPU training.

thx! this is super helpful!