gpauloski/kfac-pytorch

Running problems

Closed this issue · 2 comments

How did you install K-FAC and PyTorch?

$ git clone https://github.com/gpauloski/kfac-pytorch.git
$ cd kfac-pytorch
$ pip install -e .

What version of commit are you using?

v0.4.1

Describe the problem.

Hi, gpauloski, after #54 , another problem occurred and I can't fix it, could you help me? Thank you

torchrun --standalone --nnodes 1 --nproc_per_node=4 torch_cifar10_resnet.py
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/kfac/distributed.py:18: UserWarning: NVIDIA Apex is not installed or was not installed with --cpp_ext. Falling back to PyTorch flatten and unflatten.
'NVIDIA Apex is not installed or was not installed with --cpp_ext. '
/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/kfac/distributed.py:18: UserWarning: NVIDIA Apex is not installed or was not installed with --cpp_ext. Falling back to PyTorch flatten and unflatten.
'NVIDIA Apex is not installed or was not installed with --cpp_ext. '
/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/kfac/distributed.py:18: UserWarning: NVIDIA Apex is not installed or was not installed with --cpp_ext. Falling back to PyTorch flatten and unflatten.
'NVIDIA Apex is not installed or was not installed with --cpp_ext. '
/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/kfac/distributed.py:18: UserWarning: NVIDIA Apex is not installed or was not installed with --cpp_ext. Falling back to PyTorch flatten and unflatten.
'NVIDIA Apex is not installed or was not installed with --cpp_ext. '
[W socket.cpp:558] [c10d] The client socket has failed to connect to [txjgsv10]:55789 (errno: 22 - Invalid argument).
[W socket.cpp:558] [c10d] The clie**nt socket has failed to connect to [txjgsv10]:55789 (errno: 22 - Invalid argument).
Collecting env info...
PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.8.2003 (Core) (x86_64)
GCC version: (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
Clang version: 3.9.0 (tags/RELEASE_390/final)
CMake version: version 3.14.0
Libc version: glibc-2.17

Python version: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-1127.13.1.el7.x86_64-x86_64-with-centos-7.8.2003-Core
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100S-PCIE-32GB
GPU 2: Tesla V100-PCIE-16GB
GPU 3: Tesla V100S-PCIE-32GB

Nvidia driver version: 495.29.05
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.8.3.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.3.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.3.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.3.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.3.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.3.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.3.0
/usr/local/cuda-9.0/lib64/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] kfac-pytorch==0.4.1
[pip3] numpy==1.21.6
[pip3] torch==1.11.0
[pip3] torchinfo==1.5.2
[pip3] torchvision==0.12.0
[conda] kfac-pytorch 0.4.1 pypi_0 pypi
[conda] numpy 1.21.6 pypi_0 pypi
[conda] torch 1.11.0 pypi_0 pypi
[conda] torchinfo 1.5.2 pypi_0 pypi
[conda] torchvision 0.12.0 pypi_0 pypi

Global rank 0 initialized: local_rank = 0, world_size = 4
Global rank 1 initialized: local_rank = 1, world_size = 4
Global rank 2 initialized: local_rank = 2, world_size = 4
Global rank 3 initialized: local_rank = 3, world_size = 4
Traceback (most recent call last):
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 1350, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 1036, in _send_output
self.send(msg)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 976, in send
self.connect()
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 1443, in connect
super().connect()
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/http/client.py", line 948, in connect
(self.host,self.port), self.timeout, self.source_address)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/socket.py", line 707, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/socket.py", line 752, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "torch_cifar10_resnet.py", line 395, in
main()
File "torch_cifar10_resnet.py", line 293, in main
train_sampler, train_loader, _, val_loader = datasets.get_cifar(args)
File "/home/qzy/NGD/kfac-pytorch/examples/cnn_utils/datasets.py", line 52, in get_cifar
transform=transform_train,
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torchvision/datasets/cifar.py", line 65, in init
self.download()
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torchvision/datasets/cifar.py", line 141, in download
download_and_extract_archive(self.url, self.root, filename=self.filename, md5=self.tgz_md5)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 430, in download_and_extract_archive
download_url(url, download_root, filename, md5)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 131, in download_url
url = _get_redirect_url(url, max_hops=max_redirect_hops)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 79, in _get_redirect_url
with urllib.request.urlopen(urllib.request.Request(url, headers=headers)) as response:
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 543, in _open
'_open', req)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 1393, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 1352, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 60502 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 60503 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 60504 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 60501) of binary: /home/qzy/miniconda3/envs/env_fac/bin/python
Traceback (most recent call last):
File "/home/qzy/miniconda3/envs/env_fac/bin/torchrun", line 8, in
sys.exit(main())
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/qzy/miniconda3/envs/env_fac/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

torch_cifar10_resnet.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-06-13_11:09:10
host : txjgsv10
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 60501)
error_file: <N/A>

Hi @Elec-coder, this issue is your network being unable to connect to the URL to download the Cifar10 datasets.

To point you in the right direction, I would look at possible causes of the error on this line:

"/home/qzy/miniconda3/envs/env_fac/lib/python3.7/urllib/request.py", line 1352, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>

I would guess it is your firewall, DNS, HTTP proxy or something of that like, but I cannot help you further since it is specific to your network configuration and not K-FAC.

As a quick fix, you can download the dataset on another machine and copy it over and update the data path with the --data-dir command line argument.

Thank you!