microsoft/DeepSpeed

MPI environment variables are not set

fabiogeraci opened this issue · 2 comments

System Info
HPC ubuntu 22.04 2nodesx8H100

LSF as scheduler

[tool.poetry.dependencies]
python = "^3.10"

importlib-metadata = { version = "~=1.0", python = "<3.8" }
tensorboard = "^2.16.2"
sge-data-package = {version = "", source = "sgedata"}
torch = "2.2.1"
torchvision = "0.17.1"
torchaudio = "2.2.1"
transformers = "4.42.0"
datasets = "2.18."
accelerate = "0.28.0"
deepspeed = "0.13.4"
safetensors = "0.4.2"
mpi4py = "^4.0.0"

module load cuda-12.1.1
module load ISG/experimental/fg12/openmpi/5.0.4-cuda12.1-lsf
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

deepspeed \
    --hostfile=${HOSTFILE_PATH} \
    --launcher=OPENMPI \
    --launcher_args="-bind-to none -map-by slot --mca pml ob1 --oversubscribe --display-allocation --display-map" \
    --master_addr=${MASTER_ADDR} \
    --master_port=${_M_PORT} \
    --no_ssh_check \
    src/dna_mlm/runner.py
def setup_env_ranks() -> tp.Tuple[int, int, int]:

    # Map MPI environment variables to those expected by DeepSpeed/PyTorch
    if 'OMPI_COMM_WORLD_LOCAL_RANK' in os.environ:
        os.environ['LOCAL_RANK'] = os.environ['OMPI_COMM_WORLD_LOCAL_RANK']
        os.environ['RANK'] = os.environ['OMPI_COMM_WORLD_RANK']
        os.environ['WORLD_SIZE'] = os.environ['OMPI_COMM_WORLD_SIZE']
    else:
        raise EnvironmentError(
            "MPI environment variables are not set. "
            "Ensure you are running the script with an MPI-compatible launcher."
        )
 
 setup_env_ranks()

the function should set the env vars but instaed it raises the error

I found the error

deepspeed \
    --hostfile ${HOSTFILE_PATH} \
    --launcher "OPENMPI" \ #openmpi should have been between ""
    --launcher_args "-bind-to none -map-by slot --allow-run-as-root --mca pml ob1 --oversubscribe --display-allocation --display-map" \
    --master_addr ${MASTER_ADDR} \
    --master_port ${_M_PORT} \
    --no_ssh_check \
    src/runner.py

the real question is why I need to setup

    if 'OMPI_COMM_WORLD_LOCAL_RANK' in os.environ:
        os.environ['LOCAL_RANK'] = os.environ['OMPI_COMM_WORLD_LOCAL_RANK']
        os.environ['RANK'] = os.environ['OMPI_COMM_WORLD_RANK']
        os.environ['WORLD_SIZE'] = os.environ['OMPI_COMM_WORLD_SIZE']
    else:
        raise EnvironmentError(
            "MPI environment variables are not set. "
            "Ensure you are running the script with an MPI-compatible launcher."
        )