facebookincubator/gloo

Errors with multiple async send and receive

Opened this issue · 0 comments

Hi Friends,

I am experimenting with the GLOO async isend and irecv in my work on pipeline parallelism. With torch==1.8.1 on macOS, I will get an error libc++abi.dylib: terminating with uncaught exception of type gloo::EnforceNotMet: [enforce fail at ../third_party/gloo/gloo/transport/uv/pair.cc:333] buf. Cannot lock pointer to unbound buffer when I have multiple send requests and multiple receive requests. With torch==1.7.1 and on Linux, this example will hang forever.

Here is an minimal example:

import torch
import torch.distributed as dist

def recv_prev(rank, tag):
    input_tensor = torch.empty(1)
    recv_handle = dist.irecv(tensor=input_tensor, src=rank-1, tag=tag)
    return input_tensor, recv_handle

def send_next(rank, output_tensor, tag):
    send_handle = dist.isend(tensor=output_tensor, dst=rank+1, tag=tag)
    return send_handle

def run(rank, size, hostname):
    """
    Simulation of simple async communication
    :param rank:
    :param size:
    :param hostname:
    :return:
    """
    num_ops = 3

    for i in range(num_ops):
        if rank == 0:
            tensor = torch.ones(1) * i
            send_handle = send_next(rank, tensor, tag=i)
            print(f"RANK {rank} send {i}")

    for i in range(num_ops):
        if rank == 1:
            recv, recv_handle = recv_prev(rank, tag=i)
            recv_handle.wait()
            print(f"RANK {rank} receive {recv}")

    dist.barrier()


print("Start init...")
dist.init_process_group('gloo')
print("Init done!")
hostname = socket.gethostname()
run(dist.get_rank(), dist.get_world_size(), hostname)

I run this example with this command:
python -m torch.distributed.launch --nproc_per_node=2 minimal_example.py.

With this minimal example, the errors I get are:

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Start init...
Start init...
Init done!
Init done!
RANK 0 send 0
RANK 0 send 1
RANK 0 send 2
RANK 1 receive tensor([0.])
libc++abi.dylib: terminating with uncaught exception of type gloo::EnforceNotMet: [enforce fail at ../third_party/gloo/gloo/transport/uv/pair.cc:333] buf. Cannot lock pointer to unbound buffer
Killing subprocess 71105
Killing subprocess 71106
Traceback (most recent call last):
  File "/Users/tianyizhang/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/tianyizhang/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/tianyizhang/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/Users/tianyizhang/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/Users/tianyizhang/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/Users/tianyizhang/anaconda3/bin/python', '-u', 'simple_async_exp.py', '--local_rank=1']' died with <Signals.SIGABRT: 6>.

Can you help me understand the error message and please let me know if I am using this wrong?

Thank you in advance!