Step3 hanging for a long time

Question

Step3 hanging for a long time

Jeayea opened this issue 6 months ago · 1 comments

When running step3 with following step, it is hanging. Is there any way to fix this up?

run scripts

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team
ACTOR_MODEL_PATH=facebook/opt-2.7b
CRITIC_MODEL_PATH=facebook/opt-350m
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output_opt6.7_lora
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
    ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
    CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT

Num_Padding_at_Beginning=1 # this is model related

Actor_Lr=9.65e-6
Critic_Lr=5e-6

deepspeed --master_port 12346 main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 4 \
   --per_device_training_batch_size 4 \
   --generation_batches 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --actor_gradient_checkpointing \
   --actor_dropout 0.0 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_lora_dim 128 \
   --output_dir $OUTPUT \
    &> $OUTPUT/training_2.7b_lora_my.log

output

[2024-01-02 07:29:30,646] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-02 07:29:32,889] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-01-02 07:29:32,944] [INFO] [runner.py:571:main] cmd = /opt/conda/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=12346 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static --data_split 2,4,4 --actor_model_name_or_path facebook/opt-2.7b --critic_model_name_or_path facebook/opt-350m --num_padding_at_beginning 1 --per_device_generation_batch_size 4 --per_device_training_batch_size 4 --generation_batches 1 --ppo_epochs 1 --max_answer_seq_len 256 --max_prompt_seq_len 256 --actor_learning_rate 9.65e-6 --critic_learning_rate 5e-6 --actor_weight_decay 0.1 --critic_weight_decay 0.1 --num_train_epochs 1 --lr_scheduler_type cosine --gradient_accumulation_steps 1 --actor_gradient_checkpointing --actor_dropout 0.0 --num_warmup_steps 100 --deepspeed --seed 1234 --enable_hybrid_engine --actor_zero_stage 3 --critic_zero_stage 3 --actor_lora_dim 128 --output_dir ./output_opt6.7_lora
[2024-01-02 07:29:34,584] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-02 07:29:36,227] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.14.3
[2024-01-02 07:29:36,227] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-01-02 07:29:36,228] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-01-02 07:29:36,228] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-01-02 07:29:36,228] [INFO] [launch.py:163:main] dist_world_size=8
[2024-01-02 07:29:36,228] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-01-02 07:29:39,012] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-02 07:29:39,061] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-02 07:29:39,069] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-02 07:29:39,094] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-02 07:29:39,146] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-02 07:29:39,166] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-02 07:29:39,226] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-02 07:29:39,234] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-01-02 07:29:41,543] [INFO] [comm.py:637:init_distributed] cdb=None
/opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/opt/conda/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-01-02 07:29:41,979] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-02 07:29:42,005] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-02 07:29:42,088] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-02 07:29:42,165] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-02 07:29:42,180] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-02 07:29:42,184] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-02 07:29:42,184] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-01-02 07:29:42,187] [INFO] [comm.py:637:init_distributed] cdb=None
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:6883 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:6883 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:6883 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:6883 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.4.0aws
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /usr/local/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1

pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] ofi_init:1304 NCCL WARN NET/OFI Only EFA provider is supported

pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] ofi_init:1355 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO NET/IB : No device found.
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]veth-app1-2:169.255.254.2<0>
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Using network Socket
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 00/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 01/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 02/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 03/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 04/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 05/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 06/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 07/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 08/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 09/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 10/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 11/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 12/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 13/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 14/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 15/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 16/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 17/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 18/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 19/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 20/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 21/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 22/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Channel 23/0 : 0[101c0] -> 1[101d0] via P2P/IPC/read
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Connected all rings
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO Connected all trees
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
pytorch-1-13-gpu-p-ml-p4d-24xlarge-7ee1912e699dabdbbeb27a423adc:6883:8229 [0] NCCL INFO comm 0x55ebcc06e050 rank 0 nranks 8 cudaDev 0 busId 101c0 - Init COMPLETE
Setting model_config.dropout to 0.0
Setting model_config.attention_dropout to 0.0
Setting model_config.activation_dropout to 0.0
Setting model_config.dropout to 0.0
Setting model_config.attention_dropout to 0.0
Setting model_config.activation_dropout to 0.0
Setting model_config.dropout to 0.0
Setting model_config.attention_dropout to 0.0
Setting model_config.activation_dropout to 0.0
Setting model_config.dropout to 0.0
Setting model_config.attention_dropout to 0.0
Setting model_config.activation_dropout to 0.0
************************[start] Initializing Actor Model [start] *************************
Setting model_config.dropout to 0.0
Setting model_config.attention_dropout to 0.0
Setting model_config.activation_dropout to 0.0
Setting model_config.dropout to 0.0
Setting model_config.attention_dropout to 0.0
Setting model_config.activation_dropout to 0.0
Setting model_config.dropout to 0.0
Setting model_config.attention_dropout to 0.0
Setting model_config.activation_dropout to 0.0
Setting model_config.dropout to 0.0
Setting model_config.attention_dropout to 0.0
Setting model_config.activation_dropout to 0.0
[2024-01-02 07:30:41,488] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 517, num_elems = 2.78B
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...

environment
torch: 1.13.1
cuda: 11.7
GPU: A100 * 8
ds_report

[2024-01-02 07:40:06,091] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/opt/conda/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.12.6, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
shared memory (/dev/shm) size .... 143.40 GB

Answer 1 · 2024-01-04T09:46:15.000Z

rm -rf /root/.cache/torch_extensions/py39_cu117 may be help. For after using another instance, it hasn't hung anymore. FYI.