How to use deepspeed for multi-node and multi-card task in slurm cluster

Question

How to use deepspeed for multi-node and multi-card task in slurm cluster

dshwei opened this issue 8 months ago · 0 comments

slurm command as following :

#!/bin/bash

#SBATCH --job-name=pretrain_7                     # name
#SBATCH --nodes=2                                     # nodes
#SBATCH -w server-gpu-[10,15] 
#SBATCH --ntasks-per-node=1                            # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=80                              # number of cores per tasks
#SBATCH --gres=gpu:8                                   # number of gpus
#SBATCH --gpus-per-task=8  

srun --jobid $SLURM_JOBID bash -c '
    deepspeed --master_port 28727  \
    --num_gpus 16 \
    --num_nodes  2 \
    --hostfile hostfile \
    pre_train_ft_7b_ds.py \
    --model_path="/demo/Llama2-7b-Instruct-hf/" \
    --dataset_name="/demo/train.json" \
    --seq_length 8192 \
    --num_train_epochs 1 \

or

#!/bin/bash

#SBATCH --job-name=pretrain_7                     # name
#SBATCH --nodes=2                                     # nodes
#SBATCH -w server-gpu-[10,15] 
#SBATCH --ntasks-per-node=1                            # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=80                              # number of cores per tasks
#SBATCH --gres=gpu:8                                   # number of gpus
#SBATCH --gpus-per-task=8  
srun --jobid $SLURM_JOBID bash -c '
    python -m torch.distributed.run \
    --nnodes 2 \
    --nproc_per_node 8 \
    --master_addr $MASTER_ADDR \
    --master_port 9001 \
    pre_ft_7b_ds.py \
    --model_path="/demo/Llama2-7b-Instruct-hf/" \
    --dataset_name="/demo/train.json" \
    --seq_length 8192 \
    --num_train_epochs 1 \

the two above all has problems , maybe because node server need socket connection or ssh connect with each other