How to use deepspeed for multi-node and multi-card task in slurm cluster
dshwei opened this issue · 0 comments
dshwei commented
slurm command as following :
#!/bin/bash
#SBATCH --job-name=pretrain_7 # name
#SBATCH --nodes=2 # nodes
#SBATCH -w server-gpu-[10,15]
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=80 # number of cores per tasks
#SBATCH --gres=gpu:8 # number of gpus
#SBATCH --gpus-per-task=8
srun --jobid $SLURM_JOBID bash -c '
deepspeed --master_port 28727 \
--num_gpus 16 \
--num_nodes 2 \
--hostfile hostfile \
pre_train_ft_7b_ds.py \
--model_path="/demo/Llama2-7b-Instruct-hf/" \
--dataset_name="/demo/train.json" \
--seq_length 8192 \
--num_train_epochs 1 \
or
#!/bin/bash
#SBATCH --job-name=pretrain_7 # name
#SBATCH --nodes=2 # nodes
#SBATCH -w server-gpu-[10,15]
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=80 # number of cores per tasks
#SBATCH --gres=gpu:8 # number of gpus
#SBATCH --gpus-per-task=8
srun --jobid $SLURM_JOBID bash -c '
python -m torch.distributed.run \
--nnodes 2 \
--nproc_per_node 8 \
--master_addr $MASTER_ADDR \
--master_port 9001 \
pre_ft_7b_ds.py \
--model_path="/demo/Llama2-7b-Instruct-hf/" \
--dataset_name="/demo/train.json" \
--seq_length 8192 \
--num_train_epochs 1 \
the two above all has problems , maybe because node server need socket connection or ssh connect with each other