OS Error with torch-sparse
finalelement opened this issue · 8 comments
I was trying the run this repository with the QM9 dataset. First I ran into the issue that was reported in issue #2 and #4.
Based on that I tried downgrading the torch version to 1.7.0 and torch-geometric to both 1.6.3 and 1.7.2. However I was unable to get past the below error. I tried looking for other solutions for the below error but was not able to find many resources apart from this one here.
Perhaps if a requirement file could be shared from the owner of this repository, I would be able to create an environment where this code can run.
Let me know if more info is needed from my side.
:~/Code/geo_mol/GeoMol$ python train.py --data_dir /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/qm9 --split_path /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/splits/split0.npy --log_dir ./test_run --n_epochs 250 --dataset qm9
Traceback (most recent call last):
File "train.py", line 9, in <module>
from model.model import GeoMol
File "/home/vishwesh/Code/geo_mol/GeoMol/model/model.py", line 5, in <module>
import torch_geometric as tg
File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/__init__.py", line 5, in <module>
import torch_geometric.data
File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
from .data import Data
File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/data.py", line 8, in <module>
from torch_sparse import coalesce, SparseTensor
File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_sparse/__init__.py", line 19, in <module>
torch.ops.load_library(spec.origin)
File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch/_ops.py", line 105, in load_library
ctypes.CDLL(path)
File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/ctypes/__init__.py", line 364, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_sparse/_version_cpu.so: undefined symbol: _ZN3c106detail12infer_schema20make_function_schemaENS_8ArrayRefINS1_11ArgumentDefEEES4_
Hi,
Did you try using the packages versions shown in https://github.com/PattanaikL/GeoMol/blob/main/count_geomol_failures.ipynb ?
No luck with that as well my friend, I am receiving a new error:
What CUDA version is being used at your end? I've been trying with 10.1
/Code/geo_mol/GeoMol$ python train.py --data_dir /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/qm9 --split_path /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/splits/split0.npy --log_dir ./test_run --n_epochs 250 --dataset qm9
Traceback (most recent call last):
File "train.py", line 9, in <module>
from model.model import GeoMol
File "/home/vishwesh/Code/geo_mol/GeoMol/model/model.py", line 5, in <module>
import torch_geometric as tg
File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/__init__.py", line 2, in <module>
import torch_geometric.nn
File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/nn/__init__.py", line 2, in <module>
from .data_parallel import DataParallel
File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/nn/data_parallel.py", line 5, in <module>
from torch_geometric.data import Batch
File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
from .data import Data
File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/data.py", line 8, in <module>
from torch_sparse import coalesce, SparseTensor
File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_sparse/__init__.py", line 15, in <module>
f'{library}_{suffix}', [osp.dirname(__file__)]).origin)
AttributeError: 'NoneType' object has no attribute 'origin'
Here is also my list of package versions:
~/Code/geo_mol/GeoMol$ pip list
Package Version
----------------------------- -------------------
argon2-cffi 20.1.0
ase 3.22.1
async-generator 1.10
attrs 21.2.0
backcall 0.2.0
backports.functools-lru-cache 1.6.4
bleach 3.3.0
cached-property 1.5.2
certifi 2021.10.8
cffi 1.14.5
charset-normalizer 2.0.12
cycler 0.10.0
dataclasses 0.6
decorator 4.4.2
defusedxml 0.7.1
entrypoints 0.3
future 0.18.2
googledrivedownloader 0.4
h5py 3.6.0
idna 3.3
importlib-metadata 4.6.0
ipykernel 5.5.5
ipython 7.25.0
ipython-genutils 0.2.0
ipywidgets 7.6.3
isodate 0.6.1
jedi 0.18.0
Jinja2 3.0.1
joblib 1.1.0
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 6.1.12
jupyter-console 6.4.0
jupyter-core 4.7.1
jupyterlab-pygments 0.1.2
jupyterlab-widgets 1.0.0
kiwisolver 1.3.1
llvmlite 0.38.0
MarkupSafe 2.0.1
matplotlib 3.3.4
matplotlib-inline 0.1.2
mistune 0.8.4
mkl-fft 1.3.0
mkl-random 1.2.1
mkl-service 2.3.0
nbclient 0.5.3
nbconvert 6.1.0
nbformat 5.1.3
nest-asyncio 1.5.1
networkx 2.5.1
notebook 6.4.0
numba 0.55.1
numpy 1.20.2
olefile 0.46
packaging 20.9
pandas 1.2.5
pandocfilters 1.4.2
parso 0.8.2
pexpect 4.8.0
pickleshare 0.7.5
Pillow 8.2.0
pip 21.1.3
POT 0.7.0
prometheus-client 0.11.0
prompt-toolkit 3.0.19
ptyprocess 0.7.0
py3Dmol 0.9.1
pycparser 2.20
Pygments 2.9.0
pyparsing 2.4.7
pyrsistent 0.17.3
python-dateutil 2.8.1
python-louvain 0.16
pytz 2021.1
PyYAML 5.3.1
pyzmq 22.1.0
qtconsole 5.1.1
QtPy 1.9.0
rdflib 6.1.1
requests 2.27.1
scikit-learn 1.0.2
scipy 1.6.2
seaborn 0.11.1
Send2Trash 1.7.1
setuptools 52.0.0.post20210125
six 1.16.0
terminado 0.10.1
testpath 0.5.0
threadpoolctl 3.1.0
torch 1.7.0
torch-cluster 1.5.9
torch-geometric 1.6.3
torch-scatter 2.0.9
torch-sparse 0.6.9
torch-spline-conv 1.2.1
torchaudio 0.7.0a0+ac17b64
torchvision 0.8.0
tornado 6.1
tqdm 4.61.1
traitlets 5.0.5
typing-extensions 3.10.0.0
urllib3 1.26.9
wcwidth 0.2.5
webencodings 0.5.1
wheel 0.36.2
widgetsnbextension 3.5.1
zipp 3.4.1
Hey, apologies for all the issues you're having. Installing the correct version of torch-geometric and its dependencies has been really difficult. For my environment, I'm using CUDA 10.2. I've generated a requirements.txt
file from my environment and attached it here. Also, here's what I get from pip list
. Please let us know if any of these options are helpful.
Package Version
----------------------------- -------------------
alembic 1.6.5
argon2-cffi 20.1.0
ase 3.16.2
async-generator 1.10
attrs 21.2.0
backcall 0.2.0
backports.functools-lru-cache 1.6.4
bleach 3.3.0
certifi 2021.5.30
cffi 1.14.5
chardet 4.0.0
click 8.0.1
cliff 2.15.0
cmaes 0.8.2
cmd2 0.9.22
colorama 0.4.4
colorlog 5.0.1
CoolProp 6.4.1
cycler 0.10.0
Cython 0.29.24
decorator 4.4.2
defusedxml 0.7.1
entrypoints 0.3
Flask 1.1.2
googledrivedownloader 0.4
greenlet 1.1.0
idna 2.10
importlib-metadata 4.6.0
ipykernel 5.5.5
ipython 7.25.0
ipython-genutils 0.2.0
ipywidgets 7.6.3
isodate 0.6.0
itsdangerous 2.0.1
jedi 0.18.0
Jinja2 3.0.1
joblib 1.0.1
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 6.1.12
jupyter-console 6.4.0
jupyter-core 4.7.1
jupyterlab-pygments 0.1.2
jupyterlab-widgets 1.0.0
kiwisolver 1.3.1
Mako 1.1.4
MarkupSafe 2.0.1
matplotlib 3.3.4
matplotlib-inline 0.1.2
mistune 0.8.4
mkl-fft 1.3.0
mkl-random 1.2.1
mkl-service 2.3.0
mpmath 1.2.1
nbclient 0.5.3
nbconvert 6.1.0
nbformat 5.1.3
nest-asyncio 1.5.1
networkx 2.5.1
notebook 6.4.0
numpy 1.20.2
olefile 0.46
optuna 2.8.0
packaging 20.9
pandas 1.2.5
pandocfilters 1.4.2
parso 0.8.2
pbr 5.6.0
pexpect 4.8.0
pickleshare 0.7.5
Pillow 8.2.0
pip 21.1.3
POT 0.7.0
prettytable 0.7.2
prometheus-client 0.11.0
prompt-toolkit 3.0.19
ptyprocess 0.7.0
py-rdl 0.0.0
py3Dmol 0.9.1
pycparser 2.20
PyDAS 1.0.2
PyDQED 1.0.1
Pygments 2.9.0
pyparsing 2.4.7
pyperclip 1.8.2
pyrsistent 0.17.3
python-dateutil 2.8.1
python-editor 1.0.4
python-louvain 0.15
pytz 2021.1
PyYAML 5.3.1
pyzmq 22.1.0
qtconsole 5.1.1
QtPy 1.9.0
quantities 0.12.5
rdflib 5.0.0
requests 2.25.1
scikit-learn 0.24.2
scipy 1.6.2
seaborn 0.11.1
Send2Trash 1.7.1
setuptools 52.0.0.post20210125
six 1.16.0
SQLAlchemy 1.4.22
stevedore 2.0.1
sympy 1.8
terminado 0.10.1
testpath 0.5.0
threadpoolctl 2.1.0
torch 1.9.0
torch-cluster 1.5.9
torch-geometric 1.7.2
torch-scatter 2.0.7
torch-sparse 0.6.10
torch-spline-conv 1.2.1
torchvision 0.10.0
tornado 6.1
tqdm 4.61.1
traitlets 5.0.5
typing-extensions 3.10.0.0
urllib3 1.26.6
wcwidth 0.2.5
webencodings 0.5.1
Werkzeug 1.0.1
wheel 0.36.2
widgetsnbextension 3.5.1
zipp 3.4.1
@PattanaikL Thanks a ton for sharing the requirements.txt. Ill shortly get back with an update of how it goes when I try with the shared requirements.txt. Also thanks for sharing the CUDA toolkit number. :)
@PattanaikL I think the requirements.txt was useful. Seems like I am getting closer to be able to run this. This is the latest error that I am getting, do let me know if you have seen it before.
(geomol_v2) ~/Code/geo_mol/GeoMol$ python train.py --data_dir /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/qm9 --split_path /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/splits/split0.npy --log_dir ./test_run --n_epochs 250 --dataset qm9
Arguments are...
log_dir: ./test_run
data_dir: /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/qm9
split_path: /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/splits/split0.npy
trained_local_model: None
restart_dir: None
dataset: qm9
seed: 0
n_epochs: 250
warmup_epochs: 2
batch_size: 16
lr: 0.001
num_workers: 2
optimizer: adam
scheduler: plateau
verbose: False
model_dim: 25
random_vec_dim: 10
random_vec_std: 1
random_alpha: False
n_true_confs: 10
n_model_confs: 10
gnn1_depth: 3
gnn1_n_layers: 2
gnn2_depth: 3
gnn2_n_layers: 2
encoder_n_head: 2
coord_pred_n_layers: 2
d_mlp_n_layers: 1
h_mol_mlp_n_layers: 1
alpha_mlp_n_layers: 2
c_mlp_n_layers: 1
global_transformer: False
loss_type: ot_emd
teacher_force: False
separate_opts: False
Model parameters are:
hyperparams:
model_dim: 25
random_vec_dim: 10
random_vec_std: 1
global_transformer: False
n_true_confs: 10
n_model_confs: 10
gnn1:
depth: 3
n_layers: 2
gnn2:
depth: 3
n_layers: 2
encoder:
n_head: 2
coord_pred:
n_layers: 2
d_mlp:
n_layers: 1
h_mol_mlp:
n_layers: 1
alpha_mlp:
n_layers: 2
c_mlp:
n_layers: 1
loss_type: ot_emd
teacher_force: False
random_alpha: False
num_node_features: 44
num_edge_features: 4
Starting training...
0%| | 0/625 [00:00<?, ?it/s]
[13:51:21] Explicit valence for atom # 0 N, 4, is greater than permitted
Traceback (most recent call last):
File "train.py", line 69, in <module>
train_loss = train(model, train_loader, optimizer, device, scheduler, logger if args.verbose else None, epoch, writer)
File "/home/vishwesh/Code/geo_mol/GeoMol/model/training.py", line 18, in train
for i, data in tqdm(enumerate(loader), total=len(loader)):
File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch_geometric/data/dataset.py", line 193, in __getitem__
data = self.get(self.indices()[idx])
File "/home/vishwesh/Code/geo_mol/GeoMol/model/featurization.py", line 74, in get
data.edge_index_dihedral_pairs = get_dihedral_pairs(data.edge_index, data=data)
File "/home/vishwesh/Code/geo_mol/GeoMol/model/utils.py", line 122, in get_dihedral_pairs
keep = [t.to(device) for t in keep]
File "/home/vishwesh/Code/geo_mol/GeoMol/model/utils.py", line 122, in <listcomp>
keep = [t.to(device) for t in keep]
File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Could you try uncommenting out this line of train.py
?
@PattanaikL Thank you for the suggestion, I had to make some more changes along with the proposed to get it to run. Ill post a detailed summary tomorrow morning, don't close the issue yet. I think the shared insight will be helpful for the community as well.
Thanks a ton for helping me out in getting this to run. :)
To resolve the 'cannot re-initialize process' you must spawn method. It needed the entire training code in the train.py to be wrapped in a main() function and usage of
if name=="main":
main()
This post serves as a resolution as well.
Here is also a list of the latest requirements.txt from my side as I had to resolve some dependencies manually (This is for a conda environment with python version 3.8 and CUDA 10.2):
absl-py==1.0.0
alembic==1.6.5
argon2-cffi==20.1.0
ase==3.16.2
async-generator==1.10
attrs==21.2.0
backcall==0.2.0
backports.functools-lru-cache==1.6.4
bleach==3.3.0
brotlipy==0.7.0
cachetools==5.0.0
certifi==2021.5.30
cffi @ file:///opt/conda/conda-bld/cffi_1642701102775/work
chardet==4.0.0
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click==8.0.1
cliff==2.15.0
cmaes==0.8.2
cmd2==0.9.22
colorama==0.4.4
colorlog==5.0.1
CoolProp==6.4.1
cryptography @ file:///tmp/build/80754af9/cryptography_1639400846433/work
cycler==0.10.0
Cython==0.29.24
decorator==4.4.2
defusedxml==0.7.1
dpcpp-cpp-rt==2022.0.2
entrypoints==0.3
Flask==1.1.2
google-auth==2.6.5
google-auth-oauthlib==0.4.6
googledrivedownloader==0.4
greenlet==1.1.0
grpcio==1.44.0
idna @ file:///tmp/build/80754af9/idna_1637925883363/work
importlib-metadata==4.6.0
intel-cmplr-lib-rt==2022.0.2
intel-cmplr-lic-rt==2022.0.2
intel-opencl-rt==2022.0.2
intel-openmp==2022.0.2
ipykernel==5.5.5
ipython==7.25.0
ipython-genutils==0.2.0
ipywidgets==7.6.3
isodate==0.6.0
itsdangerous==2.0.1
jedi==0.18.0
Jinja2==3.0.1
joblib==1.0.1
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==6.1.12
jupyter-console==6.4.0
jupyter-core==4.7.1
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.0
kiwisolver==1.3.1
Mako==1.1.4
Markdown==3.3.6
MarkupSafe==2.0.1
matplotlib==3.3.4
matplotlib-inline==0.1.2
mistune==0.8.4
mkl==2022.0.2
mkl-fft==1.3.1
mkl-random @ file:///tmp/build/80754af9/mkl_random_1626186064646/work
mkl-service==2.4.0
mpmath==1.2.1
nbclient==0.5.3
nbconvert==6.1.0
nbformat==5.1.3
nest-asyncio==1.5.1
networkx==2.5.1
notebook==6.4.0
numpy==1.21.6
oauthlib==3.2.0
olefile==0.46
optuna==2.8.0
packaging==20.9
pandas==1.2.5
pandocfilters==1.4.2
parso==0.8.2
pbr==5.6.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.0.1
POT==0.7.0
prettytable==0.7.2
prometheus-client==0.11.0
prompt-toolkit==3.0.19
protobuf==3.20.0
ptyprocess==0.7.0
py3Dmol==0.9.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
Pygments==2.9.0
pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work
pyparsing==2.4.7
pyperclip==1.8.2
pyrsistent==0.17.3
PySocks @ file:///tmp/build/80754af9/pysocks_1605305779399/work
python-dateutil==2.8.1
python-editor==1.0.4
python-louvain==0.15
pytz==2021.1
PyYAML==5.3.1
pyzmq==22.1.0
qtconsole==5.1.1
QtPy==1.9.0
quantities==0.12.5
rdflib==5.0.0
rdkit-pypi==2022.3.1
requests @ file:///opt/conda/conda-bld/requests_1641824580448/work
requests-oauthlib==1.3.1
rsa==4.8
scikit-learn==0.24.2
scipy==1.6.2
seaborn==0.11.1
Send2Trash==1.8.0
six @ file:///tmp/build/80754af9/six_1644875935023/work
SQLAlchemy==1.4.22
stevedore==2.0.1
sympy==1.8
tbb==2021.5.1
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
terminado==0.10.1
testpath==0.5.0
threadpoolctl==2.1.0
torch==1.9.0+cu102
torch-cluster==1.5.9
torch-geometric==1.7.2
torch-scatter==2.0.7
torch-sparse==0.6.10
torch-spline-conv==1.2.1
torchaudio==0.9.0
torchvision==0.10.0+cu102
tornado==6.1
tqdm==4.61.1
traitlets==5.0.5
typing-extensions @ file:///opt/conda/conda-bld/typing_extensions_1647553014482/work
urllib3 @ file:///opt/conda/conda-bld/urllib3_1643638302206/work
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
widgetsnbextension==3.5.1
zipp==3.4.1
That's all I had to add, I'm closing the issue.
Thanks a ton @octavian-ganea @PattanaikL