tlkh/ai-lab

ImportError: Extension horovod.torch has not been built

yuanbw opened this issue · 2 comments

I have run the following command to test horovod pytorch frame,

the error occurs:
jovyan@560c5fd869da:~$ mpirun -np 1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python pytorch_mnist.py
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/horovod/torch/init.py", line 27, in
file, 'mpi_lib_v2')
File "/opt/conda/lib/python3.6/site-packages/horovod/common/util.py", line 50, in check_extension
'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.torch has not been built. If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "pytorch_mnist.py", line 8, in
import horovod.torch as hvd
File "/opt/conda/lib/python3.6/site-packages/horovod/torch/init.py", line 30, in
file, 'mpi_lib', '_mpi_lib')
File "/opt/conda/lib/python3.6/site-packages/horovod/common/util.py", line 50, in check_extension
'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.torch has not been built. If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[36009,1],0]
Exit code: 1

Yet, pytorch has been installed.

jovyan@560c5fd869da:~$ conda list |grep torch
pytorch 1.3.0 py3.6_cuda10.0.130_cudnn7.6.3_0 pytorch
torchtext 0.4.0 pypi_0 pypi
torchvision 0.4.1 py36_cu100 pytorch

Greetings. Sorry if this is necro bumping, but here are the steps that seem to have helped me.

Although my error was tied to the horovod.tensorflow module, it might be of some insight for the torch and mxnet correspondant.

First, I do not think it is a matter of having either torch of tensorflow being installed, but the module of horovod being built.
I happened to be using a conda environment, and after running the HOROVOD_WITH_TENSORFLOW=1 pip install horovod horovod[tensorflow] and scouring the error log, the problem was that said conda environment lacked the necessary G++ compiler to properly compile the tensorflow sub-module.

After install the [gxx_linux-64](https://anaconda.org/anaconda/gxx_linux-64} in that conda environment, the HOROVOD_WITH_TENSORFLOW=1 pip install horovod horovod[tensorflow] was able to properly build the missing module.

I guess it should be something similar in your case: after running the HOROVOD_WITH_PYTORCH=1 pip install horovod horovod[torch] and looking out for the actual build errors, you might be able to find the missing dependencies that are making the build fail, then install them before reinstalling horovod itself.

if this is happening in aws sagemaker training. how would you fix it?