/torch-ccl-segfault

MRE for CCL bindings for PyTorch crash

Primary LanguagePythonMIT LicenseMIT

torch-ccl-segfault

MRE for CCL bindings for PyTorch crash.

Steps

  1. Install PyTorch, the Intel Extensions and the CCL Bindings with:
    python -m pip install \
    torch==2.1.0a0 \
    torchvision==0.16.0a0 \
    torchaudio==2.1.0a0 \
    intel-extension-for-pytorch==2.1.10+xpu \
    oneccl-bind-pt==2.1.100+xpu \
    --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
    the version numbers are those specified by these ipex docs.
  2. Use MPI and the run_allgather.sh script to launch allgather.py:
    mpiexec.hydra -n 2 run_allgather.sh ccl 2_000_000 xpu

Note that:

  1. Running with a small tensor on the XPU with mpiexec.hydra -n 2 run_allgather.sh ccl 2_000_000 xpu works as expected.
  2. Running with a slightly bigger tensor on the XPU with mpiexec.hydra -n 2 run_allgather.sh ccl 3_000_000 xpu does not work.
  3. Running the larger tensor on a CPU with mpiexec.hydra -n 2 run_allgather.sh ccl 3_000_000 cpu does work.