Errors with multiple async send and receive
Opened this issue · 0 comments
Hi Friends,
I am experimenting with the GLOO async isend
and irecv
in my work on pipeline parallelism. With torch==1.8.1
on macOS, I will get an error libc++abi.dylib: terminating with uncaught exception of type gloo::EnforceNotMet: [enforce fail at ../third_party/gloo/gloo/transport/uv/pair.cc:333] buf. Cannot lock pointer to unbound buffer
when I have multiple send requests and multiple receive requests. With torch==1.7.1
and on Linux, this example will hang forever.
Here is an minimal example:
import torch
import torch.distributed as dist
def recv_prev(rank, tag):
input_tensor = torch.empty(1)
recv_handle = dist.irecv(tensor=input_tensor, src=rank-1, tag=tag)
return input_tensor, recv_handle
def send_next(rank, output_tensor, tag):
send_handle = dist.isend(tensor=output_tensor, dst=rank+1, tag=tag)
return send_handle
def run(rank, size, hostname):
"""
Simulation of simple async communication
:param rank:
:param size:
:param hostname:
:return:
"""
num_ops = 3
for i in range(num_ops):
if rank == 0:
tensor = torch.ones(1) * i
send_handle = send_next(rank, tensor, tag=i)
print(f"RANK {rank} send {i}")
for i in range(num_ops):
if rank == 1:
recv, recv_handle = recv_prev(rank, tag=i)
recv_handle.wait()
print(f"RANK {rank} receive {recv}")
dist.barrier()
print("Start init...")
dist.init_process_group('gloo')
print("Init done!")
hostname = socket.gethostname()
run(dist.get_rank(), dist.get_world_size(), hostname)
I run this example with this command:
python -m torch.distributed.launch --nproc_per_node=2 minimal_example.py
.
With this minimal example, the errors I get are:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Start init...
Start init...
Init done!
Init done!
RANK 0 send 0
RANK 0 send 1
RANK 0 send 2
RANK 1 receive tensor([0.])
libc++abi.dylib: terminating with uncaught exception of type gloo::EnforceNotMet: [enforce fail at ../third_party/gloo/gloo/transport/uv/pair.cc:333] buf. Cannot lock pointer to unbound buffer
Killing subprocess 71105
Killing subprocess 71106
Traceback (most recent call last):
File "/Users/tianyizhang/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/tianyizhang/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/tianyizhang/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/Users/tianyizhang/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/Users/tianyizhang/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/Users/tianyizhang/anaconda3/bin/python', '-u', 'simple_async_exp.py', '--local_rank=1']' died with <Signals.SIGABRT: 6>.
Can you help me understand the error message and please let me know if I am using this wrong?
Thank you in advance!