z-fabian/HUMUS-Net

Segmentation fault (core dumped)

Closed this issue · 10 comments

When I tried to evaluate the model with

!python3 ./humus_examples/eval_humus_fastmri.py
--checkpoint_file ./model_checkpoint/humusnet_fastmri_knee_8x_trainval.ckpt
--data_path [my directory]/knee_multicoil_test
--gpus 3 \

I got "Segmentation fault (core dumped)" error, even though I renewed the environment as follows:

numpy>=1.18.5
runstats>=1.8.0
h5py==2.10.0
PyYAML>=5.3.1
pyxb==1.2.6
xmltodict
einops==0.3.0
fastmri==0.1.1
timm==0.4.12
torchmetrics==0.7.3

It seems that there are problems ocurring related to pytorch-lightning.
Is there a way to not use pytorch-lightning?
Or is it possible for you to make a docker container sen publish in github?

Thank you in advance

Hi, this might be because you have not installed CUDA version of torch. Please make sure that you install PyTorch 1.10.1 with CUDA support from here: https://pytorch.org/get-started/previous-versions/

You can check whether you are using the CUDA version by pip3 freeze
and check if your torch version shows up as torch==1.10.1+cu113 (or other CUDA version you have installed).
Let me know if you have further issues.

I checked my PyTorch and CUDA version.

image

I found that my environment is torch==1.10.1+cu102 with Python version 3.7.11. But the code is still not working.

Thanks for checking your PyTorch version. It looks like the main difference is the main CUDA version. I recommend upgrading to CUDA 11 and reinstalling pytorch for that version directly from here. We have tested the code on CUDA 11.2 and 11.4.
I will also create a docker container and share to make sure there are no other differences with versions.

That is really kind of you! I will try running the code again after resetting my environment. I will look forward to your docker container too. Thank you in advance!

I upgraded my environment to CUDA 11.4 for driver and CUDA 11.3 for PyTorch (Seems that the official website doesn't release 11.2 and 11.4) but I still get the message "Segmentation fault (core dumped)," which I searched and many say that it is related to pytorch-lightning.

image

image

I made an exclusive docker container only just to implement this code, so I am sure that the environment is almost the same as in the requirement.txt. Since I have downloaded the pre-trained model from here and the dataset from here, I don't think the data or model itself are the problems.

Hi, thank you for your patience with this issue. I am working on the docker container now, if that doesn't work for you I will look into updating the code to a newer version of pytorch-lightning. Meanwhile, take a look at the environment that I set up based on the README instructions that can run the code (my CUDA version is also 11.4).

absl-py==1.0.0
aiohttp==3.8.1
aiosignal==1.2.0
async-timeout==4.0.2
attrs==21.4.0
cachetools==5.0.0
certifi==2021.10.8
charset-normalizer==2.0.12
einops==0.3.0
fastmri==0.1.1
frozenlist==1.3.0
fsspec==2022.3.0
future==0.18.2
google-auth==2.6.6
google-auth-oauthlib==0.4.6
grpcio==1.44.0
h5py==2.10.0
idna==3.3
imageio==2.18.0
importlib-metadata==4.11.3
Markdown==3.3.6
multidict==6.0.2
networkx==2.8
numpy==1.22.3
oauthlib==3.2.0
packaging==21.3
Pillow==9.1.0
protobuf==3.20.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyDeprecate==0.3.0
pyparsing==3.0.8
pytorch-lightning==1.3.3
PyWavelets==1.3.0
PyXB==1.2.6
PyYAML==5.4.1
requests==2.27.1
requests-oauthlib==1.3.1
rsa==4.8
runstats==2.0.0
scikit-image==0.19.2
scipy==1.8.0
six==1.16.0
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tifffile==2022.4.26
timm==0.4.12
torch==1.10.1+cu111
torchaudio==0.10.1+rocm4.1
torchmetrics==0.7.3
torchvision==0.11.2+cu111
tqdm==4.64.0
typing-extensions==4.2.0
urllib3==1.26.9
Werkzeug==2.1.1
xmltodict==0.12.0
yarl==1.7.2
zipp==3.8.0

Also, can you make sure that you installed PyTorch in a clean environment using a command from the official website link I shared using
pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html

It looks like the Segmentation fault issue is caused by some version mismatch of different torch libraries as for example mentioned here, so I want to make sure that is not the case here.

I created a docker image with all dependencies pre-installed:
docker pull zalanfabian/humus-net
I run an interactive container based on this image, cd into the HUMUS-Net directory and able to run both training and evaluation codes. Let me know how it worked out for you.

I installed the PyTorch from the official website, but I installed with this version:

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

So I installed the version with the one that you posted in the comments:

pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html

Then I got this message:
image

However, it seems that the code is working with few warnings.
image

image

image

I posted all the warnings just in case you might need it when you upgrade your PyTorch version. Your docker worked well too! Thank you for your help!

I'm happy to see that it eventually worked for you. I am aware of these warnings, but they shouldn't impact model performance in any way. The error while installing package dependencies also makes sense, the fastmri module is officially not compatible with the newer pytorch-lightning version we are using, but this also shouldn't be an issue.

@njmsjmdchtz Hello, I also met a similar problem with you. How did you solve it? Thank you!

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastmri 0.1.1 requires pytorch-lightning<1.1,>=1.0.6, but you have pytorch-lightning 1.3.3 which is incompatible.