Iain-S/torch-ccl-segfault

MRE for CCL bindings for PyTorch crash

PythonMIT

torch-ccl-segfault

MRE for CCL bindings for PyTorch crash.

Steps

Install PyTorch, the Intel Extensions and the CCL Bindings with:

python -m pip install \
torch==2.1.0a0 \
torchvision==0.16.0a0 \
torchaudio==2.1.0a0 \
intel-extension-for-pytorch==2.1.10+xpu \
oneccl-bind-pt==2.1.100+xpu \
--extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

the version numbers are those specified by these ipex docs.

Use MPI and the run_allgather.sh script to launch allgather.py:
```
mpiexec.hydra -n 2 run_allgather.sh ccl 2_000_000 xpu
```

Note that:

Running with a small tensor on the XPU with mpiexec.hydra -n 2 run_allgather.sh ccl 2_000_000 xpu works as expected.
Running with a slightly bigger tensor on the XPU with mpiexec.hydra -n 2 run_allgather.sh ccl 3_000_000 xpu does not work.
Running the larger tensor on a CPU with mpiexec.hydra -n 2 run_allgather.sh ccl 3_000_000 cpu does work.