uzh-rpg/RVT

Error on Multi GPU Training

Opened this issue · 7 comments

mauk95 commented

Hi, I am getting the following error on running multi-gpu training on gen4 dataset using the command provided in the README instructions:

�[34m�[1mwandb�[39m�[22m: logging graph, to disable use wandb.watch(log_graph=False)Using 16bit native Automatic Mixed Precision (AMP) Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a defaultModelSummarycallback. GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUsTrainer(limit_train_batches=1.0)was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used.. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 [2023-07-14 17:56:00,361][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0 [2023-07-14 17:56:10,371][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:56:20,380][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:56:30,384][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:56:40,389][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:56:50,395][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:00,404][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:10,413][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:20,416][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:30,422][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:40,427][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:57:50,430][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:00,433][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:10,440][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:20,442][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:30,445][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:40,447][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:58:50,450][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:00,459][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:10,461][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:20,472][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:30,474][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:40,478][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 17:59:50,481][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:00,487][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:10,493][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:20,503][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:30,512][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:40,521][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:00:50,529][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:00,533][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:10,536][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:20,541][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:30,542][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:40,546][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:01:50,548][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:00,554][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:10,558][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:20,562][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:30,567][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:40,569][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:02:50,573][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:00,577][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:10,583][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:20,588][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:30,615][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:40,617][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:03:50,627][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:00,635][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:10,641][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:20,646][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:30,649][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:40,660][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:04:50,661][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:00,667][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:10,671][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:20,682][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:30,685][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:40,690][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:05:50,696][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:00,701][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:10,707][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:20,711][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:30,715][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:40,723][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:06:50,725][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:00,726][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:10,735][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:20,736][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:30,741][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:40,750][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:07:50,752][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:00,754][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:10,764][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:20,771][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:30,772][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:40,777][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:08:50,780][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:00,789][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:10,799][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:20,803][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:30,811][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:40,813][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:09:50,816][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:00,827][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:10,828][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:20,836][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:30,837][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:40,840][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:10:50,841][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:00,845][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:10,853][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:20,856][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:30,860][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:40,864][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:11:50,869][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:00,875][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:10,878][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:20,889][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:30,898][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:40,905][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:12:50,907][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:00,911][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:10,918][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:20,923][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:30,932][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:40,939][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:13:50,949][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:00,956][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:10,964][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:20,972][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:30,979][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:40,984][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:14:50,988][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:00,994][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:10,998][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:21,005][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:31,010][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:41,020][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:15:51,024][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:01,029][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:11,035][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:21,040][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:31,044][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:41,051][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:16:51,054][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:01,059][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:11,063][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:21,067][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:31,077][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:41,080][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:17:51,086][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:01,087][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:11,092][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:21,096][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:31,102][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:41,107][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:18:51,110][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:01,113][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:11,117][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:21,123][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:31,128][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:41,133][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:19:51,140][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:01,144][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:11,150][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:21,154][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:31,160][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:41,164][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:20:51,166][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:01,169][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:11,176][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:21,179][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:31,182][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:41,193][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:21:51,201][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:01,210][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:11,220][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:21,224][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:31,230][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:41,239][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:22:51,250][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:01,256][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:11,260][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:21,263][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:31,265][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:41,275][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:23:51,280][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:01,283][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:11,291][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:21,300][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:31,309][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:41,318][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:24:51,328][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:01,334][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:11,340][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:21,346][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:31,349][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:41,359][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-14 18:25:51,360][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) Error executing job with overrides: ['model=rnndet', 'dataset=gen4', 'dataset.path=/netscratch/mukhan/thesis/Data/gen4/', 'wandb.project_name=RVT', 'wandb.group_name=1mpx', '+experiment/gen4=base.yaml', 'hardware.gpus=[0,1]', 'batch_size.train=12', 'batch_size.eval=12', 'hardware.num_workers.train=6', 'hardware.num_workers.eval=2'] Traceback (most recent call last): File "/netscratch/mukhan/RVT/train.py", line 138, in main trainer.fit(model=module, ckpt_path=ckpt_path, datamodule=data_module) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _run self.strategy.setup_environment() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment self.setup_distributed() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 204, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 920, in init_process_group _store_based_barrier(rank, store, timeout) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 459, in _store_based_barrier raise RuntimeError( RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I have ran the job on SLURM on 2 V100-32GB GPUS with --cpus-per-task=6. Please let me know what is the issue, thanks.

This issue might be setup related (see link1 and link2).

I suggest to go through debugging steps indicated by the Pytorch docs:
Please show the output of running your command with

export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=INFO

and another run with

export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
mauk95 commented

This issue might be setup related (see link1 and link2).

I suggest to go through debugging steps indicated by the Pytorch docs: Please show the output of running your command with

export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=INFO

and another run with

export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL

Hi @magehrig thanks for the reply. I have tried the debugging mentioned in the link you mentioned but nothing seems to work.

The output with export TORCH_DISTRIBUTED_DEBUG=INFO is as follows:

�[34m�[1mwandb�[39m�[22m: logging graph, to disable use wandb.watch(log_graph=False)Using 16bit native Automatic Mixed Precision (AMP) Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a defaultModelSummarycallback. GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUsTrainer(limit_train_batches=1.0)was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used.. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 [2023-07-16 12:52:05,230][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0 [2023-07-16 12:52:15,236][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:52:25,246][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:52:35,254][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:52:45,264][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:52:55,270][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:53:05,277][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:53:15,282][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:53:25,293][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:53:35,302][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:53:45,307][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:53:55,311][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:54:05,314][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:54:15,319][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:54:25,324][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:54:35,334][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:54:45,345][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:54:55,351][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:55:05,353][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:55:15,356][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:55:25,367][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:55:35,368][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:55:45,372][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:55:55,377][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:56:05,382][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:56:15,391][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:56:25,396][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:56:35,403][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:56:45,405][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:56:55,406][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:57:05,407][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:57:15,415][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:57:25,422][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:57:35,433][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:57:45,441][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:57:55,451][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:58:05,459][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:58:15,462][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:58:25,467][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:58:35,477][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:58:45,483][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:58:55,485][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:59:05,487][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:59:15,497][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:59:25,508][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:59:35,512][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:59:45,516][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 12:59:55,518][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:00:05,522][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:00:15,526][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:00:25,530][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:00:35,541][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:00:45,550][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:00:55,553][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:01:05,558][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:01:15,568][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:01:25,575][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:01:35,578][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:01:45,581][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:01:55,592][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:02:05,601][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:02:15,603][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:02:25,606][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:02:35,611][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:02:45,621][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:02:55,626][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:03:05,632][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:03:15,637][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:03:25,649][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:03:35,653][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:03:45,657][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:03:55,665][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:04:05,666][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:04:15,668][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:04:25,670][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:04:35,675][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:04:45,685][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:04:55,688][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:05:05,692][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:05:15,696][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:05:25,703][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:05:35,704][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:05:45,713][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:05:55,717][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:06:05,723][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:06:15,725][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:06:25,728][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:06:35,733][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:06:45,739][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:06:55,748][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:07:05,751][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:07:15,763][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:07:25,772][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:07:35,778][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:07:45,789][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:07:55,793][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:08:05,797][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:08:15,801][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:08:25,809][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:08:35,811][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:08:45,819][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:08:55,823][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:09:05,830][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:09:15,832][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:09:25,839][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:09:35,850][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:09:45,854][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:09:55,855][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:10:05,864][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:10:15,869][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:10:25,874][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:10:35,875][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:10:45,878][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:10:55,886][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:11:05,896][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:11:15,899][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:11:25,908][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:11:35,910][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:11:45,919][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:11:55,929][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:12:05,934][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:12:15,938][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:12:25,945][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:12:35,950][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:12:45,955][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:12:55,960][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:13:05,963][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:13:15,966][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:13:25,971][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:13:35,979][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:13:45,986][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:13:55,996][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:14:06,004][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:14:16,012][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:14:26,021][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:14:36,026][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:14:46,034][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:14:56,037][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:15:06,045][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:15:16,049][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:15:26,059][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:15:36,069][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:15:46,075][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:15:56,082][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:16:06,084][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:16:16,092][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:16:26,103][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:16:36,112][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:16:46,119][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:16:56,122][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:17:06,127][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:17:16,137][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:17:26,140][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:17:36,147][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:17:46,156][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:17:56,163][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:18:06,170][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:18:16,175][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:18:26,186][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:18:36,192][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:18:46,197][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:18:56,200][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:19:06,210][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:19:16,214][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:19:26,224][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:19:36,233][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:19:46,238][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:19:56,243][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:20:06,253][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:20:16,255][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:20:26,266][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:20:36,272][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:20:46,279][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:20:56,289][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:21:06,296][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:21:16,300][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:21:26,308][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:21:36,316][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:21:46,321][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) [2023-07-16 13:21:56,326][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) Error executing job with overrides: ['model=rnndet', 'dataset=gen4', 'dataset.path=/netscratch/mukhan/thesis/Data/gen4/', 'wandb.project_name=RVT', 'wandb.group_name=1mpx', '+experiment/gen4=base.yaml', 'hardware.gpus=[0,1]', 'batch_size.train=12', 'batch_size.eval=12', 'hardware.num_workers.train=6', 'hardware.num_workers.eval=2'] Traceback (most recent call last): File "/netscratch/mukhan/RVT/train.py", line 143, in main benchmark=config.reproduce.benchmark, File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _run self.strategy.setup_environment() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment self.setup_distributed() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 204, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 920, in init_process_group _store_based_barrier(rank, store, timeout) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 459, in _store_based_barrier raise RuntimeError( RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:30:00) Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

The another run with export TORCH_DISTRIBUTED_DEBUG=DETAIL gives the following output:

�[34m�[1mwandb�[39m�[22m: logging graph, to disable use wandb.watch(log_graph=False)Using 16bit native Automatic Mixed Precision (AMP) Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a defaultModelSummarycallback. GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUsTrainer(limit_train_batches=1.0)was configured so 100% of the batches per epoch will be used..Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used.. Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 Error executing job with overrides: ['model=rnndet', 'dataset=gen4', 'dataset.path=/netscratch/mukhan/thesis/Data/gen4/', 'wandb.project_name=RVT', 'wandb.group_name=1mpx', '+experiment/gen4=base.yaml', 'hardware.gpus=[0,1]', 'batch_size.train=12', 'batch_size.eval=12', 'hardware.num_workers.train=6', 'hardware.num_workers.eval=2'] Traceback (most recent call last): File "/netscratch/mukhan/RVT/train.py", line 143, in main benchmark=config.reproduce.benchmark, File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _run self.strategy.setup_environment() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 153, in setup_environment self.setup_distributed() File "/opt/conda/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 204, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/lightning_lite/utilities/distributed.py", line 237, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs) File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group default_pg = _new_process_group_helper( File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1064, in _new_process_group_helper backend_class = _create_process_group_wrapper( File "/opt/conda/envs/rvt/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3400, in _create_process_group_wrapper helper_pg = ProcessGroupGloo(store, rank, world_size, timeout=timeout) RuntimeError: Socket Timeout Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Are you using NCCL or GLOO?
Have you tried both?

Hatins commented

Hi @mauk95
I have met the same problem when using multi-gpu for training (but not in RVT). I found this problem is caused by the lack of GPU memory, further affect the communication between the GPUs. So may you could decrease the number of the batch_size and then try again,

mauk95 commented

Are you using NCCL or GLOO? Have you tried both?

@magehrig I am using NCCL. Yes I tried GLOO as well but no success yet.

mauk95 commented

Hi @mauk95 I have met the same problem when using multi-gpu for training (but not in RVT). I found this problem is caused by the lack of GPU memory, further affect the communication between the GPUs. So may you could decrease the number of the batch_size and then try again,

Hi @Hatins I tried your suggestion, even set the BATCH_SIZE_PER_GPU=1 but same error. I am not sure what is the issue here.

Sorry @mauk95, but this is really hard to debug since I cannot reproduce this. Have you successfully run other projects in Pytorch DDP mode on the same machine/cluster? If yes, you probably have to break the code down to a minimal working example and add complexity step by step to figure out where it breaks.