aws/aws-ofi-nccl

NCCL internal error after aws-ofi-nccl upgrade to version 1.7.4

ps-stability opened this issue · 6 comments

After upgradining aws-ofi-nccl to version 1.7.4 workloads fail on ncclCommInit during torch.distributed.barrier() with:

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, internal error - please report this issue to the NCCL developers, NCCL version 2.18.5
ncclInternalError: Internal check failed.

The problem reproduces on PyTorch 2.1.0, 2.1.2 and the nightly.

The code I run is very simple:

import os

import torch  # pytorch needs to be imported before tensorflow or bad things happen
import torch.backends.cuda

def get_rank_0_ip(ip_string):
    ip = ip_string.split(",")[0]
    ip = ip.replace("[", "")
    # HACK: remove the last part of the IP address
    parts = ip.split("-")
    num_parts=5
    if len(parts) > num_parts:
        ip = "-".join(parts[:num_parts])
    return ip

def _init_slurm():
    world_size = int(os.environ.get("SLURM_NTASKS", 1))
    global_rank = int(os.environ.get("SLURM_PROCID", 0))
    local_size = int(os.environ.get("SLURM_NTASKS_PER_NODE", 1))
    local_rank = int(os.environ.get("SLURM_LOCALID", 0))
    num_nodes = int(os.environ.get("SLURM_JOB_NUM_NODES", 1))
    os.environ["MASTER_ADDR"] = get_rank_0_ip(os.environ.get("SLURM_JOB_NODELIST"))
    os.environ["MASTER_PORT"] = "29500"
    os.environ["RANK"] = str(global_rank)
    os.environ["LOCAL_RANK"] = str(local_rank)
    os.environ["WORLD_SIZE"] = str(world_size)
    os.environ["LOCAL_WORLD_SIZE"] = str(local_size)
    return world_size, global_rank, local_size, local_rank, num_nodes


def main():
    world_size, global_rank, local_size, local_rank, num_nodes = _init_slurm()
    
    device = torch.device("cuda", local_rank)
    #model_parallel_size = 16

    torch.cuda.set_device(device)

    torch.distributed.init_process_group(
        backend="nccl", rank=global_rank, world_size=world_size
    )

    print("Number of nodes", num_nodes)
    print("total GPUs", world_size)
    print("Torch device count:", torch.cuda.device_count())
    assert (
        torch.cuda.device_count() * num_nodes == world_size
    ), "You need to use all GPUs on all nodes"

    torch.distributed.barrier()
    print("passed barrier function")

if __name__ == "__main__":
    main()

I'm running this on 32 devices (4xp4d-24xl). What might be the root cause of this issue?

Hi. I was unable to reproduce the problem using your script on 4 P5s I had on hand.

Can you provide your full run command, and output with the environment variable NCCL_DEBUG=Info set?

Hi @rauteric! Here's the run script (executed with sbatch):

#!/bin/bash                                                                     

#SBATCH --partition=<partition_name>                                                       
#SBATCH --job-name=test                                              
#SBATCH --nodes=4                                     
#SBATCH --ntasks-per-node=8                                                     
#SBATCH --gpus-per-node=8                                                       
#SBATCH --cpus-per-gpu=10                                                       
#SBATCH --output=output_logs/test_%A_%a.out                          
#SBATCH --account <account name>

module load cuda/12.1 # match the cuda in the container image                   

export NCCL_DEBUG=INFO
export TORCH_CPP_LOG_LEVEL=INFO

srun python3 test_torch_distributed.py 

Here's the output with some extra c10d logging:
output_job_test_49665_4294967294.log

Thanks. From your log file, the most immediate problem I see is:

ip-10-0-193-153:1208459:1209143 [0] platform_config_endpoint:478 NCCL WARN NET/OFI GDR disabled on GDR-supported instance type p4d.24xlarge

This is a check we added in plugin version 1.7.4 to prevent using P4/P5 instance types without GDR support, which would result in significantly reduced performance.

Can you please share some information on your software setup? Which AMI you are using, and whether you are running this script in a container.

I confirm that that was the root cause. Disabling the check with export OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK=1 fixed the issue. Was this check aimed to cause and NCCL error to raise, or there might be something else that also contributed? If the former, I think it would be a good idea to improve how it's handled.

Edit: I'm using DLAMI and the script was not run in container.

Hi, sorry for the late response.

No, this check was not supposed to cause the NCCL error. The check will simply cause the plugin to fail to initialize, which makes NCCL fall back to its internal sockets provider. It would have been nicer to make the execution fail entirely, but NCCL defaults to falling back to sockets. I'm not sure why the sockets provider is causing the error you see.

But more importantly, without GDR support (when setting export OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK=1) you will not get good performance, since NCCL will need to make an extra host copy for each transfer. The reason for the check is to prevent users from unknowingly getting poor performance on P4/P5 instance types that support GDR.

As for why GDR isn’t supported: if you are using a DLAMI for GPUs, it should have all the software needed for GDR support. Did you make any modifications to the software installed in the DLAMI (other than upgrading the plugin)? Updating to a more recently released GPU DLAMI may help as well.

Hi @rauteric, I'm getting the same error. Our AMIs are sourced from https://github.com/awslabs/amazon-eks-ami. I'm not sure why GDR might be disabled. Any tips for debugging or tracking the reason?

Also, just learned from a colleague that we were able to get OFI working with GDR on the same instance with the same AMI, only difference is the Docker image that the app is running on. What could be missing? I couldn't see anything obviously wrong.