PattanaikL/GeoMol

OS Error with torch-sparse

finalelement opened this issue · 8 comments

I was trying the run this repository with the QM9 dataset. First I ran into the issue that was reported in issue #2 and #4.

Based on that I tried downgrading the torch version to 1.7.0 and torch-geometric to both 1.6.3 and 1.7.2. However I was unable to get past the below error. I tried looking for other solutions for the below error but was not able to find many resources apart from this one here.

Perhaps if a requirement file could be shared from the owner of this repository, I would be able to create an environment where this code can run.

Let me know if more info is needed from my side.

:~/Code/geo_mol/GeoMol$ python train.py --data_dir /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/qm9 --split_path /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/splits/split0.npy --log_dir ./test_run --n_epochs 250 --dataset qm9
Traceback (most recent call last):
  File "train.py", line 9, in <module>
    from model.model import GeoMol
  File "/home/vishwesh/Code/geo_mol/GeoMol/model/model.py", line 5, in <module>
    import torch_geometric as tg
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/__init__.py", line 5, in <module>
    import torch_geometric.data
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
    from .data import Data
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/data.py", line 8, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_sparse/__init__.py", line 19, in <module>
    torch.ops.load_library(spec.origin)
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch/_ops.py", line 105, in load_library
    ctypes.CDLL(path)
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_sparse/_version_cpu.so: undefined symbol: _ZN3c106detail12infer_schema20make_function_schemaENS_8ArrayRefINS1_11ArgumentDefEEES4_

Hi,
Did you try using the packages versions shown in https://github.com/PattanaikL/GeoMol/blob/main/count_geomol_failures.ipynb ?

No luck with that as well my friend, I am receiving a new error:

What CUDA version is being used at your end? I've been trying with 10.1

/Code/geo_mol/GeoMol$ python train.py --data_dir /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/qm9 --split_path /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/splits/split0.npy --log_dir ./test_run --n_epochs 250 --dataset qm9
Traceback (most recent call last):
  File "train.py", line 9, in <module>
    from model.model import GeoMol
  File "/home/vishwesh/Code/geo_mol/GeoMol/model/model.py", line 5, in <module>
    import torch_geometric as tg
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/__init__.py", line 2, in <module>
    import torch_geometric.nn
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/nn/__init__.py", line 2, in <module>
    from .data_parallel import DataParallel
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/nn/data_parallel.py", line 5, in <module>
    from torch_geometric.data import Batch
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
    from .data import Data
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/data.py", line 8, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_sparse/__init__.py", line 15, in <module>
    f'{library}_{suffix}', [osp.dirname(__file__)]).origin)
AttributeError: 'NoneType' object has no attribute 'origin'

Here is also my list of package versions:

~/Code/geo_mol/GeoMol$ pip list
Package                       Version
----------------------------- -------------------
argon2-cffi                   20.1.0
ase                           3.22.1
async-generator               1.10
attrs                         21.2.0
backcall                      0.2.0
backports.functools-lru-cache 1.6.4
bleach                        3.3.0
cached-property               1.5.2
certifi                       2021.10.8
cffi                          1.14.5
charset-normalizer            2.0.12
cycler                        0.10.0
dataclasses                   0.6
decorator                     4.4.2
defusedxml                    0.7.1
entrypoints                   0.3
future                        0.18.2
googledrivedownloader         0.4
h5py                          3.6.0
idna                          3.3
importlib-metadata            4.6.0
ipykernel                     5.5.5
ipython                       7.25.0
ipython-genutils              0.2.0
ipywidgets                    7.6.3
isodate                       0.6.1
jedi                          0.18.0
Jinja2                        3.0.1
joblib                        1.1.0
jsonschema                    3.2.0
jupyter                       1.0.0
jupyter-client                6.1.12
jupyter-console               6.4.0
jupyter-core                  4.7.1
jupyterlab-pygments           0.1.2
jupyterlab-widgets            1.0.0
kiwisolver                    1.3.1
llvmlite                      0.38.0
MarkupSafe                    2.0.1
matplotlib                    3.3.4
matplotlib-inline             0.1.2
mistune                       0.8.4
mkl-fft                       1.3.0
mkl-random                    1.2.1
mkl-service                   2.3.0
nbclient                      0.5.3
nbconvert                     6.1.0
nbformat                      5.1.3
nest-asyncio                  1.5.1
networkx                      2.5.1
notebook                      6.4.0
numba                         0.55.1
numpy                         1.20.2
olefile                       0.46
packaging                     20.9
pandas                        1.2.5
pandocfilters                 1.4.2
parso                         0.8.2
pexpect                       4.8.0
pickleshare                   0.7.5
Pillow                        8.2.0
pip                           21.1.3
POT                           0.7.0
prometheus-client             0.11.0
prompt-toolkit                3.0.19
ptyprocess                    0.7.0
py3Dmol                       0.9.1
pycparser                     2.20
Pygments                      2.9.0
pyparsing                     2.4.7
pyrsistent                    0.17.3
python-dateutil               2.8.1
python-louvain                0.16
pytz                          2021.1
PyYAML                        5.3.1
pyzmq                         22.1.0
qtconsole                     5.1.1
QtPy                          1.9.0
rdflib                        6.1.1
requests                      2.27.1
scikit-learn                  1.0.2
scipy                         1.6.2
seaborn                       0.11.1
Send2Trash                    1.7.1
setuptools                    52.0.0.post20210125
six                           1.16.0
terminado                     0.10.1
testpath                      0.5.0
threadpoolctl                 3.1.0
torch                         1.7.0
torch-cluster                 1.5.9
torch-geometric               1.6.3
torch-scatter                 2.0.9
torch-sparse                  0.6.9
torch-spline-conv             1.2.1
torchaudio                    0.7.0a0+ac17b64
torchvision                   0.8.0
tornado                       6.1
tqdm                          4.61.1
traitlets                     5.0.5
typing-extensions             3.10.0.0
urllib3                       1.26.9
wcwidth                       0.2.5
webencodings                  0.5.1
wheel                         0.36.2
widgetsnbextension            3.5.1
zipp                          3.4.1

Hey, apologies for all the issues you're having. Installing the correct version of torch-geometric and its dependencies has been really difficult. For my environment, I'm using CUDA 10.2. I've generated a requirements.txt file from my environment and attached it here. Also, here's what I get from pip list. Please let us know if any of these options are helpful.


Package                       Version          
----------------------------- -------------------
alembic                       1.6.5
argon2-cffi                   20.1.0
ase                           3.16.2
async-generator               1.10
attrs                         21.2.0
backcall                      0.2.0
backports.functools-lru-cache 1.6.4
bleach                        3.3.0
certifi                       2021.5.30
cffi                          1.14.5
chardet                       4.0.0
click                         8.0.1
cliff                         2.15.0
cmaes                         0.8.2
cmd2                          0.9.22
colorama                      0.4.4
colorlog                      5.0.1
CoolProp                      6.4.1
cycler                        0.10.0
Cython                        0.29.24
decorator                     4.4.2
defusedxml                    0.7.1
entrypoints                   0.3
Flask                         1.1.2
googledrivedownloader         0.4
greenlet                      1.1.0
idna                          2.10
importlib-metadata            4.6.0
ipykernel                     5.5.5
ipython                       7.25.0
ipython-genutils              0.2.0
ipywidgets                    7.6.3
isodate                       0.6.0
itsdangerous                  2.0.1
jedi                          0.18.0
Jinja2                        3.0.1
joblib                        1.0.1
jsonschema                    3.2.0
jupyter                       1.0.0
jupyter-client                6.1.12
jupyter-console               6.4.0
jupyter-core                  4.7.1
jupyterlab-pygments           0.1.2
jupyterlab-widgets            1.0.0
kiwisolver                    1.3.1
Mako                          1.1.4
MarkupSafe                    2.0.1
matplotlib                    3.3.4
matplotlib-inline             0.1.2
mistune                       0.8.4
mkl-fft                       1.3.0
mkl-random                    1.2.1
mkl-service                   2.3.0
mpmath                        1.2.1
nbclient                      0.5.3
nbconvert                     6.1.0
nbformat                      5.1.3
nest-asyncio                  1.5.1
networkx                      2.5.1
notebook                      6.4.0
numpy                         1.20.2
olefile                       0.46
optuna                        2.8.0
packaging                     20.9
pandas                        1.2.5
pandocfilters                 1.4.2
parso                         0.8.2
pbr                           5.6.0
pexpect                       4.8.0
pickleshare                   0.7.5
Pillow                        8.2.0
pip                           21.1.3
POT                           0.7.0
prettytable                   0.7.2
prometheus-client             0.11.0
prompt-toolkit                3.0.19
ptyprocess                    0.7.0
py-rdl                        0.0.0
py3Dmol                       0.9.1
pycparser                     2.20
PyDAS                         1.0.2
PyDQED                        1.0.1
Pygments                      2.9.0
pyparsing                     2.4.7
pyperclip                     1.8.2
pyrsistent                    0.17.3
python-dateutil               2.8.1
python-editor                 1.0.4
python-louvain                0.15
pytz                          2021.1
PyYAML                        5.3.1
pyzmq                         22.1.0
qtconsole                     5.1.1
QtPy                          1.9.0
quantities                    0.12.5
rdflib                        5.0.0
requests                      2.25.1
scikit-learn                  0.24.2
scipy                         1.6.2
seaborn                       0.11.1
Send2Trash                    1.7.1
setuptools                    52.0.0.post20210125
six                           1.16.0
SQLAlchemy                    1.4.22
stevedore                     2.0.1
sympy                         1.8
terminado                     0.10.1
testpath                      0.5.0
threadpoolctl                 2.1.0
torch                         1.9.0
torch-cluster                 1.5.9
torch-geometric               1.7.2
torch-scatter                 2.0.7
torch-sparse                  0.6.10
torch-spline-conv             1.2.1
torchvision                   0.10.0
tornado                       6.1
tqdm                          4.61.1
traitlets                     5.0.5
typing-extensions             3.10.0.0
urllib3                       1.26.6
wcwidth                       0.2.5
webencodings                  0.5.1
Werkzeug                      1.0.1
wheel                         0.36.2
widgetsnbextension            3.5.1
zipp                          3.4.1

requirements.txt

@PattanaikL Thanks a ton for sharing the requirements.txt. Ill shortly get back with an update of how it goes when I try with the shared requirements.txt. Also thanks for sharing the CUDA toolkit number. :)

@PattanaikL I think the requirements.txt was useful. Seems like I am getting closer to be able to run this. This is the latest error that I am getting, do let me know if you have seen it before.

(geomol_v2) ~/Code/geo_mol/GeoMol$ python train.py --data_dir /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/qm9 --split_path /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/splits/split0.npy --log_dir ./test_run --n_epochs 250 --dataset qm9
Arguments are...
log_dir: ./test_run
data_dir: /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/qm9
split_path: /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/splits/split0.npy
trained_local_model: None
restart_dir: None
dataset: qm9
seed: 0
n_epochs: 250
warmup_epochs: 2
batch_size: 16
lr: 0.001
num_workers: 2
optimizer: adam
scheduler: plateau
verbose: False
model_dim: 25
random_vec_dim: 10
random_vec_std: 1
random_alpha: False
n_true_confs: 10
n_model_confs: 10
gnn1_depth: 3
gnn1_n_layers: 2
gnn2_depth: 3
gnn2_n_layers: 2
encoder_n_head: 2
coord_pred_n_layers: 2
d_mlp_n_layers: 1
h_mol_mlp_n_layers: 1
alpha_mlp_n_layers: 2
c_mlp_n_layers: 1
global_transformer: False
loss_type: ot_emd
teacher_force: False
separate_opts: False

Model parameters are:
hyperparams:
  model_dim: 25
  random_vec_dim: 10
  random_vec_std: 1
  global_transformer: False
  n_true_confs: 10
  n_model_confs: 10
  gnn1:
    depth: 3
    n_layers: 2
  gnn2:
    depth: 3
    n_layers: 2
  encoder:
    n_head: 2
  coord_pred:
    n_layers: 2
  d_mlp:
    n_layers: 1
  h_mol_mlp:
    n_layers: 1
  alpha_mlp:
    n_layers: 2
  c_mlp:
    n_layers: 1
  loss_type: ot_emd
  teacher_force: False
  random_alpha: False
num_node_features: 44
num_edge_features: 4


Starting training...
  0%|                                                                                                                                                     | 0/625 [00:00<?, ?it/s]
[13:51:21] Explicit valence for atom # 0 N, 4, is greater than permitted
Traceback (most recent call last):
  File "train.py", line 69, in <module>
    train_loss = train(model, train_loader, optimizer, device, scheduler, logger if args.verbose else None, epoch, writer)
  File "/home/vishwesh/Code/geo_mol/GeoMol/model/training.py", line 18, in train
    for i, data in tqdm(enumerate(loader), total=len(loader)):
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch_geometric/data/dataset.py", line 193, in __getitem__
    data = self.get(self.indices()[idx])
  File "/home/vishwesh/Code/geo_mol/GeoMol/model/featurization.py", line 74, in get
    data.edge_index_dihedral_pairs = get_dihedral_pairs(data.edge_index, data=data)
  File "/home/vishwesh/Code/geo_mol/GeoMol/model/utils.py", line 122, in get_dihedral_pairs
    keep = [t.to(device) for t in keep]
  File "/home/vishwesh/Code/geo_mol/GeoMol/model/utils.py", line 122, in <listcomp>
    keep = [t.to(device) for t in keep]
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Could you try uncommenting out this line of train.py?

@PattanaikL Thank you for the suggestion, I had to make some more changes along with the proposed to get it to run. Ill post a detailed summary tomorrow morning, don't close the issue yet. I think the shared insight will be helpful for the community as well.

Thanks a ton for helping me out in getting this to run. :)

To resolve the 'cannot re-initialize process' you must spawn method. It needed the entire training code in the train.py to be wrapped in a main() function and usage of

if name=="main":
main()

This post serves as a resolution as well.

Here is also a list of the latest requirements.txt from my side as I had to resolve some dependencies manually (This is for a conda environment with python version 3.8 and CUDA 10.2):

absl-py==1.0.0
alembic==1.6.5
argon2-cffi==20.1.0
ase==3.16.2
async-generator==1.10
attrs==21.2.0
backcall==0.2.0
backports.functools-lru-cache==1.6.4
bleach==3.3.0
brotlipy==0.7.0
cachetools==5.0.0
certifi==2021.5.30
cffi @ file:///opt/conda/conda-bld/cffi_1642701102775/work
chardet==4.0.0
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click==8.0.1
cliff==2.15.0
cmaes==0.8.2
cmd2==0.9.22
colorama==0.4.4
colorlog==5.0.1
CoolProp==6.4.1
cryptography @ file:///tmp/build/80754af9/cryptography_1639400846433/work
cycler==0.10.0
Cython==0.29.24
decorator==4.4.2
defusedxml==0.7.1
dpcpp-cpp-rt==2022.0.2
entrypoints==0.3
Flask==1.1.2
google-auth==2.6.5
google-auth-oauthlib==0.4.6
googledrivedownloader==0.4
greenlet==1.1.0
grpcio==1.44.0
idna @ file:///tmp/build/80754af9/idna_1637925883363/work
importlib-metadata==4.6.0
intel-cmplr-lib-rt==2022.0.2
intel-cmplr-lic-rt==2022.0.2
intel-opencl-rt==2022.0.2
intel-openmp==2022.0.2
ipykernel==5.5.5
ipython==7.25.0
ipython-genutils==0.2.0
ipywidgets==7.6.3
isodate==0.6.0
itsdangerous==2.0.1
jedi==0.18.0
Jinja2==3.0.1
joblib==1.0.1
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==6.1.12
jupyter-console==6.4.0
jupyter-core==4.7.1
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.0
kiwisolver==1.3.1
Mako==1.1.4
Markdown==3.3.6
MarkupSafe==2.0.1
matplotlib==3.3.4
matplotlib-inline==0.1.2
mistune==0.8.4
mkl==2022.0.2
mkl-fft==1.3.1
mkl-random @ file:///tmp/build/80754af9/mkl_random_1626186064646/work
mkl-service==2.4.0
mpmath==1.2.1
nbclient==0.5.3
nbconvert==6.1.0
nbformat==5.1.3
nest-asyncio==1.5.1
networkx==2.5.1
notebook==6.4.0
numpy==1.21.6
oauthlib==3.2.0
olefile==0.46
optuna==2.8.0
packaging==20.9
pandas==1.2.5
pandocfilters==1.4.2
parso==0.8.2
pbr==5.6.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.0.1
POT==0.7.0
prettytable==0.7.2
prometheus-client==0.11.0
prompt-toolkit==3.0.19
protobuf==3.20.0
ptyprocess==0.7.0
py3Dmol==0.9.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
Pygments==2.9.0
pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work
pyparsing==2.4.7
pyperclip==1.8.2
pyrsistent==0.17.3
PySocks @ file:///tmp/build/80754af9/pysocks_1605305779399/work
python-dateutil==2.8.1
python-editor==1.0.4
python-louvain==0.15
pytz==2021.1
PyYAML==5.3.1
pyzmq==22.1.0
qtconsole==5.1.1
QtPy==1.9.0
quantities==0.12.5
rdflib==5.0.0
rdkit-pypi==2022.3.1
requests @ file:///opt/conda/conda-bld/requests_1641824580448/work
requests-oauthlib==1.3.1
rsa==4.8
scikit-learn==0.24.2
scipy==1.6.2
seaborn==0.11.1
Send2Trash==1.8.0
six @ file:///tmp/build/80754af9/six_1644875935023/work
SQLAlchemy==1.4.22
stevedore==2.0.1
sympy==1.8
tbb==2021.5.1
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
terminado==0.10.1
testpath==0.5.0
threadpoolctl==2.1.0
torch==1.9.0+cu102
torch-cluster==1.5.9
torch-geometric==1.7.2
torch-scatter==2.0.7
torch-sparse==0.6.10
torch-spline-conv==1.2.1
torchaudio==0.9.0
torchvision==0.10.0+cu102
tornado==6.1
tqdm==4.61.1
traitlets==5.0.5
typing-extensions @ file:///opt/conda/conda-bld/typing_extensions_1647553014482/work
urllib3 @ file:///opt/conda/conda-bld/urllib3_1643638302206/work
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
widgetsnbextension==3.5.1
zipp==3.4.1

That's all I had to add, I'm closing the issue.

Thanks a ton @octavian-ganea @PattanaikL