amiralansary/rl-medical

multi-agent landmark detection

Closed this issue · 11 comments

when I run the multi-agents landmark detect project, there is always a Segmentation fault.how to fix it?

gml16 commented

Would you mind sharing the command you are using and any traceback you receive please so I can diagnose the problem?
The multi-agent project has been exported to github.com/gml16/rl-medical, could you try running this one instead as well? Sorry for the confusion, I will add it to the readme.

thanks fot your answer, but I have run the project in github.com/gml16/rl-medical, it still not work.

image
p.s.段错误(核心已转储)means:Segmentation fault.
I am sorry that my linux language is Chinese.

gml16 commented

Oh I see, not much information there. I don't remember ever having encountered this issue. Would you mind sharing your py36torch environment so I could try to reproduce it? If it comes from the environment I will add an environment file with the exact Python and package versions I am using. PS: no need to apologies for the language setting :)

EDIT: does the evaluation command work for you or also returns a segmentation fault?

image
this is my environment.Unluckily, evaluation command work also returns a segmentation fault👀 👀

gml16 commented

Thanks for sharing your environment. I created a conda environment with only the necessary modules and using the same versions as you. I didn't receive any segmentation fault but received an error from Numpy (updating to a newer version solved the issue), and open-cv is required to run the code but isn't in your environment (as far as I can tell).
I've exported my minimal conda environment using Python 3.6 (which isn't so small after all). Would you mind saving the text below as env.yml and then run conda env create -f env.yml, and finally conda activate rl-medical-36 and try again?
Hopefully this solves this cryptic seg fault.

Edit: if you prefer, I've added on the gml16/rl-medical repo an environment.yml file using Python 3.8 that I have tested on two machines.

name: rl-medical-36
channels:
  - pytorch
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - blas=1.0=mkl
  - ca-certificates=2020.7.22=0
  - certifi=2020.6.20=py36_0
  - cudatoolkit=10.1.243=h6bb024c_0
  - intel-openmp=2020.2=254
  - ld_impl_linux-64=2.33.1=h53a641e_7
  - libedit=3.1.20191231=h14c3975_1
  - libffi=3.3=he6710b0_2
  - libgcc-ng=9.1.0=hdf63c60_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - mkl=2020.2=256
  - mkl-service=2.3.0=py36he904b0f_0
  - mkl_fft=1.2.0=py36h23d657b_0
  - mkl_random=1.1.1=py36h0573a6f_0
  - ncurses=6.2=he6710b0_1
  - ninja=1.10.1=py36hfd86e86_0
  - openssl=1.1.1h=h7b6447c_0
  - pip=20.2.3=py36_0
  - python=3.6.12=hcff3b4d_2
  - pytorch=1.4.0=py3.6_cuda10.1.243_cudnn7.6.3_0
  - readline=8.0=h7b6447c_0
  - setuptools=49.6.0=py36_1
  - six=1.15.0=py_0
  - sqlite=3.33.0=h62c20be_0
  - tk=8.6.10=hbc83047_0
  - wheel=0.35.1=py_0
  - xz=5.2.5=h7b6447c_0
  - zlib=1.2.11=h7b6447c_3
  - pip:
    - absl-py==0.10.0
    - aiohttp==3.6.2
    - async-timeout==3.0.1
    - attrs==20.2.0
    - cachetools==4.1.1
    - chardet==3.0.4
    - cloudpickle==1.6.0
    - cycler==0.10.0
    - future==0.18.2
    - google-auth==1.22.0
    - google-auth-oauthlib==0.4.1
    - grpcio==1.32.0
    - gym==0.17.3
    - idna==2.10
    - idna-ssl==1.1.0
    - importlib-metadata==2.0.0
    - markdown==3.2.2
    - matplotlib==2.0.2
    - msgpack==1.0.0
    - msgpack-numpy==0.4.7.1
    - multidict==4.7.6
    - numpy==1.19.2
    - oauthlib==3.1.0
    - olefile==0.46
    - opencv-python==4.4.0.44
    - pillow==4.2.1
    - protobuf==3.13.0
    - psutil==5.7.2
    - pyasn1==0.4.8
    - pyasn1-modules==0.2.8
    - pyglet==1.5.0
    - pyparsing==2.4.7
    - python-dateutil==2.8.1
    - pytz==2020.1
    - pyzmq==19.0.2
    - requests==2.24.0
    - requests-oauthlib==1.3.0
    - rsa==4.6
    - scipy==1.5.2
    - simpleitk==1.2.4
    - tabulate==0.8.7
    - tensorboard==2.3.0
    - tensorboard-plugin-wit==1.7.0
    - tensorpack==0.9.5
    - termcolor==1.1.0
    - tqdm==4.50.0
    - typing-extensions==3.7.4.3
    - urllib3==1.25.10
    - werkzeug==1.0.1
    - yarl==1.6.0
    - zipp==3.3.0

thanks for your help .However, when I use your environment.yml file to run this project, I received a new error:
image
the error said it needs 42.4GB video card memory?

gml16 commented

Great, I think it should be working then. This memory error is due to the fact the default replay buffer is quite large, try setting the flag --memory_size to a lower value, such as 1000 and, similarly, set --init_memory_size to 500.
Do you know if the evaluation works now?

Edit: the default training values are the ones I used to produce the models presented in the paper, I will add to the Readme to reduce the memory size if it is an issue.

yes,I set a lower value and it works ! I tried the evaluation command and received this error:pyglet.canvas.xlib.NoSuchDisplayException: Cannot connect to "None"
That should be my problem,I think the reason for the error is render function.My server hasn't installed a fake screen yet,How do you get training images from the server

gml16 commented

Good news :) Yes you receive this error because you cannot render, you can use the flag --viz 0 to disable rendering

Thank you for your patient guidance,your idea inspires me.😁😁

gml16 commented

Glad I could help :)
You can also use the flag --saveGif to save the evaluation as a gif. That can be useful on headless machines.