Intel® oneCCL Bindings for PyTorch (formerly known as torch_ccl)

This repository holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library (oneCCL).

Introduction

PyTorch is an open-source machine learning framework.

Intel® oneCCL (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the oneCCL documentation and oneCCL specification.

oneccl_bindings_for_pytorch module implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now.

Pytorch API Align

We recommend Anaconda as Python package management system. The following is the corresponding branches (tags) of oneccl_bindings_for_pytorch and supported Pytorch.

`torch`	`oneccl_bindings_for_pytorch`
`master`	`master`
v1.12.0	ccl_torch1.12
v1.11.0	ccl_torch1.11
v1.10.0	ccl_torch1.10
v1.9.0	ccl_torch1.9
v1.8.1	ccl_torch1.8
v1.7.1	ccl_torch1.7
v1.6.0	ccl_torch1.6
v1.5-rc3	beta09

The usage details can be found in the README of corresponding branch. The following part is about the usage of v1.9 tag. if you want to use other version of torch-ccl please checkout to that branch(tag). For pytorch-1.5.0-rc3, the #PR28068 and #PR32361 are need to dynamicall register external ProcessGroup and enable alltoall collective communication primitive. The patch file about these two PRs is in patches directory and you can use it directly.

Requirements

Python 3.6 or later and a C++17 compiler
PyTorch v1.12.0

Installation

Install from Source

clone the oneccl_bindings_for_pytorch.

git clone https://github.com/intel/torch-ccl.git && cd torch-ccl
git submodule sync
git submodule update --init --recursive

Install oneccl_bindings_for_pytorch
```
python setup.py install
```

Install PreBuilt Wheel

Wheel files are avaiable for the following Python versions.

Extension Version	Python 3.6	Python 3.7	Python 3.8	Python 3.9	Python 3.10
1.12.0		√	√	√	√
1.11.0		√	√	√	√
1.10.0	√	√	√	√

python -m pip install oneccl_bind_pt==1.12.0 -f https://developer.intel.com/ipex-whl-stable

Note: oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0)

Usage

example.py

import torch.nn.parallel
import torch.distributed as dist
import oneccl_bindings_for_pytorch

...

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['RANK'] = str(os.environ.get('PMI_RANK', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1))

backend = 'ccl'
dist.init_process_group(backend, ...)
my_rank = dist.get_rank()
my_size = dist.get_world_size()
print("my rank = %d  my size = %d" % (my_rank, my_size))

...

model = torch.nn.parallel.DistributedDataParallel(model, ...)

...

(oneccl_bindings_for_pytorch is installed along with the MPI tool set.)

source <oneccl_bindings_for_pytorch_path>/env/setvars.sh

# eg:
#   $ oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
#   $ source $oneccl_bindings_for_pytorch_path/env/setvars.sh

mpirun -n <N> -ppn <PPN> -f <hostfile> python example.py

Performance Debugging

For debugging performance of communication primitives PyTorch's Autograd profiler can be used to inspect time spent inside oneCCL calls.

Example:

profiling.py

import torch.nn.parallel
import torch.distributed as dist
import oneccl_bindings_for_pytorch
import os

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['RANK'] = str(os.environ.get('PMI_RANK', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1))

backend = 'ccl'
dist.init_process_group(backend)
my_rank = dist.get_rank()
my_size = dist.get_world_size()
print("my rank = %d  my size = %d" % (my_rank, my_size))

x = torch.ones([2, 2])
y = torch.ones([4, 4])
with torch.autograd.profiler.profile(record_shapes=True) as prof:
    for _ in range(10):
        dist.all_reduce(x)
        dist.all_reduce(y)
dist.barrier()
print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))

mpirun -n 2 -l python profiling.py

[0] my rank = 0  my size = 2
[0] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[0]                                                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls          Input Shapes
[0] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[0]                oneccl_bindings_for_pytorch::allreduce        91.41%     297.900ms        91.41%     297.900ms      29.790ms            10              [[2, 2]]
[0]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         8.24%      26.845ms         8.24%      26.845ms       2.684ms            10      [[2, 2], [2, 2]]
[0]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         0.30%     973.651us         0.30%     973.651us      97.365us            10      [[4, 4], [4, 4]]
[0]                oneccl_bindings_for_pytorch::allreduce         0.06%     190.254us         0.06%     190.254us      19.025us            10              [[4, 4]]
[0] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[0] Self CPU time total: 325.909ms
[0]
[1] my rank = 1  my size = 2
[1] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[1]                                                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls          Input Shapes
[1] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[1]                oneccl_bindings_for_pytorch::allreduce        96.03%     318.551ms        96.03%     318.551ms      31.855ms            10              [[2, 2]]
[1]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         3.62%      12.019ms         3.62%      12.019ms       1.202ms            10      [[2, 2], [2, 2]]
[1]                oneccl_bindings_for_pytorch::allreduce         0.33%       1.082ms         0.33%       1.082ms     108.157us            10              [[4, 4]]
[1]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         0.02%      56.505us         0.02%      56.505us       5.651us            10      [[4, 4], [4, 4]]
[1] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[1] Self CPU time total: 331.708ms
[1]

License

BSD License

H-Huang/torch-ccl