gpauloski/kfac-pytorch

Running Problems

Closed this issue · 3 comments

How did you install K-FAC and PyTorch?

$ git clone https://github.com/gpauloski/kfac-pytorch.git
$ cd kfac-pytorch
$ pip install -e .

What version of commit are you using?

v0.4.1

Describe the problem.

Hi ! gpauloski,
I use "pip " to insatll "requirements-dev.txt" and "requirements.txt". when I running this project using
"torchrun --standalone --nnodes 1 --nproc_per_node=4 torch_cifar10_resnet.py " , I encounter the following problems,
SyntaxError: invalid syntax
/home/NGD/kfac-pytorch/kfac/distributed.py:18: UserWarning: NVIDIA Apex is not installed or was not installed with --cpp_ext. Falling back to PyTorch flatten and unflatten.
'NVIDIA Apex is not installed or was not installed with --cpp_ext. '
Traceback (most recent call last):
File "torch_cifar10_resnet.py", line 11, in
import cnn_utils.engine as engine
File "/home/NGD/kfac-pytorch/examples/cnn_utils/engine.py", line 10, in
import kfac
File "/home/NGD/kfac-pytorch/kfac/init.py", line 5, in
import kfac.base_preconditioner as base_preconditioner
File "/home/NGD/kfac-pytorch/kfac/base_preconditioner.py", line 16, in
from kfac.layers.base import KFACBaseLayer
File "", line 1
(src=)

How to solve this? Thank you!

Hi there, can you run this pytorch diagnostic script and paste the output so I can see your training environment.

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Thank for your prompt reply, the following are the output,

Collecting environment information...
PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.8.2003 (Core) (x86_64)
GCC version: (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
Clang version: 3.9.0 (tags/RELEASE_390/final)
CMake version: version 3.14.0
Libc version: glibc-2.17

Python version: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-3.10.0-1127.13.1.el7.x86_64-x86_64-with-centos-7.8.2003-Core
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100S-PCIE-32GB
GPU 2: Tesla V100-PCIE-16GB
GPU 3: Tesla V100S-PCIE-32GB

Nvidia driver version: 495.29.05
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.8.3.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.3.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.3.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.3.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.3.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.3.0
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.3.0
/usr/local/cuda-9.0/lib64/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] kfac-pytorch==0.4.1
[pip3] mypy==0.950
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.6
[pip3] torch==1.11.0
[pip3] torchinfo==1.5.2
[pip3] torchvision==0.12.0
[conda] kfac-pytorch 0.4.1 dev_0
[conda] numpy 1.21.6 pypi_0 pypi
[conda] torch 1.11.0 pypi_0 pypi
[conda] torchinfo 1.5.2 pypi_0 pypi
[conda] torchvision 0.12.0 pypi_0 pypi

At first, when I use pip install -r requirements.txt, an ERROR arise:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
markdown 3.3.7 requires importlib-metadata>=4.4; python_version < "3.10", but you have importlib-metadata 4.2.0 which is incompatible.

Then, I use “pip install importlib-metadata==4.4”, there exist another error

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. flake8 4.0.1 requires importlib-metadata<4.3; python_version < "3.8", but you have importlib-metadata 4.4.0 which is incompatible.

Your original issue is because the repo uses an f-string format that is not supported in Python 3.7. Specifically f'{my_var=}'.

I've addressed the Python 3.7 compatibility issues in #55 and merged into main.

For the second issue with pip's dependency resolver, the conflict is that flake8 requires an older version of importlib-metadata than what markdown requires (and markdown is required by TensorBoard). This is a known incompatibility.

You have a few options:

  • Do not install the requirements-dev.txt dependencies. You only need these dependencies if you plan to contribute to the repo as they are the packages required by the CI workflows (explained here). If you just want to run the examples, you just need PyTorch >=1.8 and the dependencies in examples/requirements.txt.
  • If you want to run the examples and the CI workflows (e.g., tox), then use separate virtualenvs for the examples and for the CI (this is the recommended solution from the flake8 maintainers).
  • Upgrade to Python 3.8 or later to avoid flake8's importlib-metadata version restrictions.
  • Install everything but TensorBoard and disable TensorBoard in the examples to avoid installing markdown.