Doodleverse/segmentation_gym

integrate transformers library in donda env for `segformer` model option

dbuscombe-usgs opened this issue ยท 18 comments

the segformer model is now fully integrated by there remain some issues with the conda environment

In #115 @CameronBodine noted

... it threw an error (see below). I again had issues with not finding libcuda library, similar to what I noted on #78 , so I went through the process of re-installing cuda and nvida on my device (see https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html).

I didn't get the error on my Ubuntu box but did on my Windows box. I'm looking for a conda env workaround

I believe this is my recipe for install on Windows:

  1. Made sure NVIDIA drivers were up to date.
  2. Install and set libmamba as the default environment solver in base environment following this.
  3. Install everything else except tensorflow as recommended in #78, except I did python=3.9 instead of 3.8. It may work with 3.10 but I have not tested:
conda create -n gym python=3.9
conda activate gym
conda install -c conda-forge scipy "numpy>=1.16.5, <=1.23.0" scikit-image cython ipython joblib tqdm pandas pip plotly natsort pydensecrf matplotlib 
pip install doodleverse_utils transformers
  1. Then I followed tensorflow install instructions for installing on Windows Native OS:
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
# Anything above 2.10 is not supported on the GPU on Windows Native
python -m pip install "tensorflow<2.11"
# Verify install:
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Hope this works for you!

Thanks, I will give this a try!

This reminds me I need to nuke the 'pydensecrf' requirement from the docs

I successfully installed the conda env, but it doesn't work. I get the same error

2023-02-24 11:51:04.512581: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.2
  /usr/local/cuda

I didnt first make sure NVIDIA drivers were up to date. I don't know how to do this, and dont remember ever having to do this before

FYI:

(gym) PS E:\Python\segmentation_gym> nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:12:52_Pacific_Daylight_Time_2019
Cuda compilation tools, release 10.1, V10.1.243

Ugh, windows

conda install -n base conda-libmamba-solver fails too

Sucky. Maybe fresh miniconda install?

Hmmm. I shouldnt need to update my drivers, or reinstall conda. That would be too disruptive for me. I'm going to see if I can figure out a conda solution

FYI:

(gym) PS E:\Python\segmentation_gym> conda info

     active environment : gym
    active env location : C:\Users\csb67\AppData\Local\miniconda3\envs\gym
            shell level : 2
       user config file : C:\Users\csb67\.condarc
 populated config files : C:\Users\csb67\.condarc
          conda version : 23.1.0
    conda-build version : not installed
         python version : 3.10.9.final.0
       virtual packages : __archspec=1=x86_64
                          __cuda=12.0=0
                          __win=0=0
       base environment : C:\Users\csb67\AppData\Local\miniconda3  (writable)
      conda av data dir : C:\Users\csb67\AppData\Local\miniconda3\etc\conda
  conda av metadata url : None
           channel URLs : https://repo.anaconda.com/pkgs/main/win-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/win-64
                          https://repo.anaconda.com/pkgs/r/noarch
                          https://repo.anaconda.com/pkgs/msys2/win-64
                          https://repo.anaconda.com/pkgs/msys2/noarch
                          https://conda.anaconda.org/conda-forge/win-64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : C:\Users\csb67\AppData\Local\miniconda3\pkgs
                          C:\Users\csb67\.conda\pkgs
                          C:\Users\csb67\AppData\Local\conda\conda\pkgs
       envs directories : C:\Users\csb67\AppData\Local\miniconda3\envs
                          C:\Users\csb67\.conda\envs
                          C:\Users\csb67\AppData\Local\conda\conda\envs
               platform : win-64
             user-agent : conda/23.1.0 requests/2.28.1 CPython/3.10.9 Windows/10 Windows/10.0.19044 solver/libmamba conda-libmamba-solver/22.8.1 libmambapy/1.3.1
          administrator : False
             netrc file : None
           offline mode : False

I do believe the change to python 3.10 was a significant one. I have had to reinstall miniconda on both of my Windows computers recently. For what it's worth!

I think I've now exhausted all options except updating nvidia or conda, which I'm not currently prepared to do. I suppose I will not make segformer models on windows

I was able to install miniconda and use Cam's mamba recipe to install a gym environment. It works with the Unets, but not the segformer model. I'll keep troubleshooting

I did notice in my PINGMapper.yml that I list installing transformers after tensorflow. Perhaps an order of operations thing??

name: ping
channels:
  - conda-forge
  - defaults
dependencies:
  - python
  - pandas
  - rasterio
  - pyproj
  - scikit-image
  - joblib
  - gdal
  - matplotlib
  - pip
  - pip:
      - psutil
      - tensorflow
      - transformers

With miniconda and mamba, I can now get a working environment for training any unet models, in python 3.8, 3.9, and 3.10. I can do this using either the pip or conda way of installing TF, or the conda-forge way.

The only issue is using SegFormer models. It errors out with the same message every time. It doesn't matter if I install transformers using pip or conda, before or after TF

I have not been able to update my nvidia drivers. I simply can't find a link that will allow me to install something to " C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.2", which is where transformers expects the cuda drivers to be (what am I missing?)

I thought one major advantage with installing TF the "conda-forge" route was not having to update nvidia drivers on windows.

If I attempt to install the 11.2 cuda toolkit, from here: https://developer.nvidia.com/cuda-11.2.0-download-archive?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exelocal

I get warning messages saying that I'm about to downgrade NVIDIA versions, which doesn't seem right

On neither windows computer I have, both of which have gym environments working well for Unets, do I have access to the program nvcc. I'm very reluctant to go this route right now, for fear that I will break my working conda envs

Eureka!! Add this to the conda env to make it use segformers

conda install cuda -c nvidia

I will update the README

Ok, I posted new conda recipes for Gym that allows for use of segformers on windows and ubuntu

  • This solution does not require updating NVIDIA drivers using executable installers downloaded from nvidia webpages
  • I have a recipe and yml file that appear to work on Windows 10
  • I tested on Windows 10 (two computers) and Ubuntu 22.4 (one computer)
  • I tested using miniconda and anaconda. Miniconda is a lot faster, and uses less hard drive space, so that's what I'm recommending going forward
  • Neither the windows recipe nor the yml file worked on minconda on my ubuntu box, so I came up with a new recipe that got complicated because it involved operating system commands. Anyway, new recipes on the updated README

Thanks @CameronBodine for helping troubleshoot and test!

https://github.com/Doodleverse/segmentation_gym#%EF%B8%8F-installation