MRE for CCL bindings for PyTorch crash.
- Install PyTorch, the Intel Extensions and the CCL Bindings with:
the version numbers are those specified by these ipex docs.
python -m pip install \ torch==2.1.0a0 \ torchvision==0.16.0a0 \ torchaudio==2.1.0a0 \ intel-extension-for-pytorch==2.1.10+xpu \ oneccl-bind-pt==2.1.100+xpu \ --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
- Use MPI and the
run_allgather.sh
script to launchallgather.py
:mpiexec.hydra -n 2 run_allgather.sh ccl 2_000_000 xpu
Note that:
- Running with a small tensor on the XPU with
mpiexec.hydra -n 2 run_allgather.sh ccl 2_000_000 xpu
works as expected. - Running with a slightly bigger tensor on the XPU with
mpiexec.hydra -n 2 run_allgather.sh ccl 3_000_000 xpu
does not work. - Running the larger tensor on a CPU with
mpiexec.hydra -n 2 run_allgather.sh ccl 3_000_000 cpu
does work.