kevinjohncutler/omnipose

GPU support for Mac M1?

alix-pham opened this issue · 22 comments

Hi :)

I know the README says GPU speed can only be used with NVIDIA GPUs (Linux and Windows), I was wondering if you were planning on providing support for Mac M1 GPUs in the future?

I found this article, would it work? (I guess not because you use CUDA and CUDA cannot work with M1?)

Thank you very much in advance!

Best,
Alix

@alix-pham I will definitely get Apple Silicon GPU support working. It should be as simple as changing the device type from cuda in several places, probably not more than a few hours of work once I get the chance. In my testing so far, the much more difficult thing is getting a conda environment installed that has all the dependencies built for arm64... so releasing an environment is also on my to-do list.

Thank you very much @kevinjohncutler! Looking forward to it.

@alix-pham I got it working on my M2 Macbook Air. About 2-4x speed improvement on the built-in test images in the GUI. Note that this is for evaluation only, I have not tested training yet. Most of the changes are in the cellpose backend, but pulling the omnipose changes and running pip install -e . should fetch the up to date cellpose changes as well. I need to post an environment file for apple silicon installs, but if you are able to help test, that would be great.

Thank you so much @kevinjohncutler! I would say I am not familiar enough with those things to help you test it, unfortunately... I am not even sure I understand what I should do to try GPU support on my Mac with the information you provided... 😬
I updated omnipose using pip install git+https://github.com/kevinjohncutler/omnipose.git but when running the segmentation I get that output (the same as usual)

2022-10-20 11:04:18,530 [INFO] TORCH GPU version not installed/working.
>>> GPU activated? 0
2022-10-20 11:04:18,531 [INFO] >>bact_phase_omni<< model set to be used
2022-10-20 11:04:18,532 [INFO] >>>> using CPU

Is there something else I should do?
Where should I run pip install -e .? I did not clone the repo on my computer (and as I am not sure what I'm supposed to do with it, I didn't run it yet).

Thanks in advance!

PS: Congrats on your Nature Methods paper 🥳

Thanks @alix-pham! I see, I figured since you were requesting mac GPU support you knew what a world of pain you were getting yourself into haha. I may finally have time this weekend to put together a conda environment and installation instructions to make it relatively painless. It's possible that the macOS GUI executable will also 'just work', but I need to compile a new version. The issue you are running into is just to do with the dependencies, but the conda environment will aim to solve that.

No, sorry! I only want the process to be faster, as we are processing big movies, and using the GPU should make it faster.
Though I am not using the GUI because I'm complementing the segmentation with a tracking pipeline; I figured it would be easier that way.
Thank you very much!

Hi @kevinjohncutler,

Thanks for all your work! I can confirm the GPU support for Mac M1/2 with your modifications, but unfortunately I'm receiving an Attribute error while trying to train a new model.

I tested on M1 mac:
Python 3.10.4

and Windows 11
Python = 3.8.4
pytorch = 1.11.0
cudatoolkit = 11.3.1

Here is the report:

> python -m omnipose --train --use_gpu --dir ./Documents/omni --mask_filter _masks --n_epochs 100 --pretrained_model None --learning_rate 0.1 --diameter 0 --batch_size 16 --RAdam
!NEW LOGGING SETUP! To see cellpose progress, set --verbose
No --verbose => no progress or info printed
2022-11-16 16:02:24,512 [INFO] ** TORCH GPU version installed and working. **
2022-11-16 16:02:24,512 [INFO] >>>> using GPU
Omnipose enabled. See Omnipose repo for licencing details.
2022-11-16 16:02:24,512 [INFO] Training omni model. Setting nclasses=4, RAdam=True
2022-11-16 16:02:24,514 [INFO] not all flows are present, will run flow generation for all images
2022-11-16 16:02:24,515 [INFO] training from scratch
2022-11-16 16:02:24,515 [INFO] median diameter set to 0 => no rescaling during training
2022-11-16 16:02:24,601 [INFO] No precomuting flows with Omnipose. Computed during training.
2022-11-16 16:02:24,608 [INFO] >>> Using RAdam optimizer
2022-11-16 16:02:24,608 [INFO] >>>> training network with 2 channel input <<<<
2022-11-16 16:02:24,608 [INFO] >>>> LR: 0.10000, batch_size: 16, weight_decay: 0.00001
2022-11-16 16:02:24,608 [INFO] >>>> ntrain = 2
2022-11-16 16:02:24,608 [INFO] >>>> nimg_per_epoch = 2
/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/cellpose/core.py:1105: UserWarning: The operator 'aten::linalg_vector_norm' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1668586478573/work/aten/src/ATen/mps/MPSFallback.mm:11.)
  denom = torch.multiply(torch.linalg.norm(x,dim=1),torch.linalg.norm(y,dim=1))+eps
2022-11-16 16:02:33,471 [INFO] Epoch 0, Time  8.9s, Loss 4.7680, LR 0.1000
2022-11-16 16:02:34,080 [INFO] saving network parameters to /Users/mcruz/Documents/omni/models/cellpose_residual_on_style_on_concatenation_off_omni_nclasses_4_omni_2022_11_16_16_02_24.602007
Traceback (most recent call last):
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/omnipose/__main__.py", line 3, in <module>
    main(omni_CLI=True)
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/cellpose/__main__.py", line 476, in main
    cpmodel_path = model.train(images, labels, train_files=image_names,
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/cellpose/models.py", line 1045, in train
    model_path = self._train_net(train_data, train_labels,
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/cellpose/core.py", line 1057, in _train_net
    self.net.save_model(file_name)
  File "/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1504, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'save_model'  

Best,
Mario

@mccruz07 Thanks for the report! I have not tried training on apple silicon yet, but it looks like that might be a simple fix. I'll look into it in the next week.

@mccruz07 Turns out all GPU training was broken due to a recent change I made to fix a bug for CPU training. I fixed it now in cellpose-omni v0.7.3. I will test it on an M2 mac in the next couple days, but let me know if you get a chance to test it earlier.

@kevinjohncutler Thank you! But know I'm reciving the follow error:

/Users/mcruz/opt/anaconda3/envs/omnipose/lib/python3.10/site-packages/cellpose/core.py:1111: UserWarning: The operator 'aten::linalg_vector_norm' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications.

Update: with torch 1.13.1, training is working on Apple Silicon =D @mccruz07 I am still getting that warning, but no errors. Really rough benchmark on a small (5-image) dataset: Titan RTX takes 34.1s for the first 100 epochs, M2 GPU takes 95.7s, so about 2.8x slower. For reference, CPU training on my Ubuntu machine with a core i9 9900K is 161.3s (1.7x slower than M2 GPU) and M2 CPU takes 277.6s (2.9x slower than M2 GPU).

I'm thinking about getting a Mac Studio to have much more VRAM than any consumer NVIDIA card can offer... @michaels10, what config do you have?

Hi -- I have Mac M1 Ultra; that said, I just tried GPU and it doesn't seem to be working for me; I get a rather uneventful
2023-01-12 17:59:42,705 [INFO] TORCH GPU version not installed/working. error message.

I used pip install -e . on the cloned repo. I'm running python -m cellpose --train --pretrained_model bact_omni --use_gpu --chan 0 --dir "/Users/michaelsandler/Documents/experiments/omnipose-test/training-data" --n_epochs 100 --learning_rate 0.1 --verbose command to run things. Pytorch version 1.12, which I believe in their versioning scheme is greater than 1.4?

Minimal set to reproduce same as in the other bug.

Thanks @michaels10, good to know. 64 or 128GB of RAM? In addition to the cellose_omni bug which I hope is now fixed for you (it should now download v0.8.0 or higher), it's pobably that your conda environment is not set up for pytorch on M1 - you are right, torch 1.13.1 is what you want. I will update this repo with an environment file for Macs.

Ok, try out omnipose_mac_environment.yml. I installed it with

conda env create --name omnipose --file /Volumes/DataDrive/omnipose_mac_environment.yml
conda activate omnipose
pip install git+https://github.com/kevinjohncutler/omnipose.git
pip install git+https://github.com/kevinjohncutler/cellpose-omni.git

To my amazement, it worked the first time around. However, I have some notes from my first attempts getting this to work months back, and it is possible that some dependencies actually need to be compiled from source and my conda environment is just using the right versions from the base environment... we shall see once more people try this out.

Tried it, alas to no avail. **
PyTorch is version 1.13.1 -- also, my computer is the 128GB model.

The only potentially relevant warning I get is:

/Users/michaelsandler/opt/anaconda3/envs/omnipose/lib/python3.9/runpy.py:127: RuntimeWarning: 'cellpose_omni.__main__' found in sys.modules after import of package 'cellpose_omni', but prior to execution of 'cellpose_omni.__main__'; this may result in unpredictable behaviour

**Weirdly, someone in our lab with the exact same setup is having no problems. Maybe something is wrong with my base environment.

Interesting. I'm not sure what to make of that error, but I know from practice that one way to be totally sure your environment is disjoint from base is by specifying a different version of python. Omnipose works on every version of python I've tried so far (3.8.5+), so if your base is on 3.9, maybe you should try 3.10.8.

Update: After some extended sleuthing, I found that this was tied to Rosetta, which I had left enabled... whoops.
Works well, didn't even need to set MPS as fallback!

That said, it does spit out the following warning:
UserWarning: The operator 'aten::linalg_vector_norm' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:11.)

This is obviously an upstream issue, and I already get substantial speedup (5x), so I'm pretty happy with this. Is there anything you'd like me to test?

@michaels10 very nice! If you are using my environment file, I actually do set the evironment variable PYTORCH_ENABLE_MPS_FALLBACK: '1'. I use torch.linalg.norm in 4 places, and maybe there is a workaround for the default 2-norm (just square and square root explicitly) to avoid that CPU fallback without waiting for another pytorch release.

Thanks for offering! Have you tried any training so far, or just evaluation of existing models?

Update on GPU performance: got my hands on a Mac Studio (M1 Ultra, 128GB) and it took 115.3s for the same 100 epoch test I ran earlier. I ran it again to make sure, and got 97.8s, 93.1s, and then 93.8s. Not sure what explains the speed difference (njit compilation, perhaps).

I implemented a workaround for the vector_norm function like I said, and it did speed things up a bit. Times were 79.1s, 69.4s, and 67.7s for three trials. So this is roughly 1/2 as fast as the Titan RTX, but with over 5x the available memory.

Unfortunately, I just found out (while attempting to run a 3D model) that pytorch does not currently support a bunch of basic 3D functions like conv3d. The whole reason I got the Mac Studio was to use Omnipose on memory-intensive 3D volumes. I'll either have to scrap that plan or implement those functions myself.

For evaluation, I should note that it takes my Titan RTX about 0.4s on my default GUI image and 0.6s on the M1, 2/3 as fast.

Hi @kevinjohncutler, using Apple Silicon with the same config as @michaels10 but on using the omnipose_mac_environment.yml to create environment, get the following error

(base) omnipose % conda env create --name omnipose_sil --file omnipose_mac_environment.yml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:

  • sqlite==3.39.3=h2229b38_0
  • brotli==1.0.9=h1c322ee_7
  • nb_conda_kernels==2.3.1=py39h2804cbe_1
  • libllvm11==11.1.0=hfa12f05_4
  • liblapack==3.9.0=16_osxarm64_openblas
  • brotli-bin==1.0.9=h1c322ee_7
  • h5py==3.7.0=nompi_py39h6b51346_101
  • libzlib==1.2.12=h03a7124_3
  • jupyter_core==4.11.1=py39h2804cbe_0
  • tensorflow-base==2.10.0=cpu_py39h0d4f425_0
  • markupsafe==2.1.1=py39hb18efdd_1
  • giflib==5.2.1=h27ca646_2
  • bzip2==1.0.8=h3422bc3_4
  • libtiff==4.4.0=hfa0b094_4
  • cffi==1.15.1=py39h04d3946_0
  • icu==70.1=h6b3803e_0
  • zstd==1.5.2=h8128057_4
  • libpng==1.6.38=h76d750c_0
  • libzopfli==1.0.3=h9f76cd9_0
  • flatbuffers==2.0.7=hb7217d7_0
  • re2==2022.06.01=h9a09cb3_0
  • zlib-ng==2.0.6=he4db4b2_0
  • libwebp-base==1.2.4=h57fd34a_0
  • python==3.9.13=hc596b02_0_cpython
  • c-blosc2==2.4.2=h303ed30_0
  • libgfortran==5.0.0=11_3_0_hd922786_25
  • libaec==1.0.6=hbdafb3b_0
  • lcms2==2.12=had6a04f_0
  • libcxx==14.0.6=h2692d47_0
  • lz4-c==1.9.3=hbdafb3b_1
  • dav1d==1.0.0=he4db4b2_1
  • cfitsio==4.1.0=hd4f5c17_0
  • charls==2.3.4=hbdafb3b_0
  • llvm-openmp==14.0.4=hd125106_0
  • libffi==3.4.2=h3422bc3_5
  • ca-certificates==2022.9.24=h4653dfc_0
  • tensorflow-estimator==2.10.0=cpu_py39h63f9d84_0
  • fastremap==1.13.3=py39h4aae847_0
  • c-ares==1.18.1=h3422bc3_0
  • libavif==0.10.1=h3d80962_2
  • tornado==6.2=py39h9eb174b_0
  • aiohttp==3.8.3=py39h02fc5c5_0
  • libbrotlidec==1.0.9=h1c322ee_7
  • jpeg==9e=he4db4b2_2
  • openjpeg==2.5.0=h5d4e404_1
  • ncurses==6.3=h07bb92c_1
  • libnghttp2==1.47.0=h232270b_1
  • libsodium==1.0.18=h27ca646_1
  • imagecodecs==2022.9.26=py39h6bc43d6_0
  • click==8.1.3=py39h2804cbe_0
  • libprotobuf==3.21.7=hb5ab8b9_0
  • blosc==1.21.1=hd414afc_3
  • cryptography==38.0.1=py39haa0b8cc_0
  • pyzmq==24.0.1=py39h0553236_0
  • tensorflow==2.10.0=cpu_py39h2839aeb_0
  • importlib-metadata==4.11.4=py39h2804cbe_0
  • zeromq==4.3.4=hbdafb3b_1
  • openssl==1.1.1q=ha287fd2_0
  • libcblas==3.9.0=16_osxarm64_openblas
  • libopenblas==0.3.21=openmp_hc731615_3
  • libev==4.33=h642e427_1
  • libgfortran5==11.3.0=hdaf2cc0_25
  • lerc==4.0.0=h9a09cb3_0
  • brunsli==0.1=h9f76cd9_0
  • xz==5.2.6=h57fd34a_0
  • numba==0.56.2=py39h251cc7c_1
  • libbrotlicommon==1.0.9=h1c322ee_7
  • grpc-cpp==1.47.1=h503f348_6
  • zlib==1.2.12=h03a7124_3
  • tensorboard-data-server==0.6.0=py39hbe5e4b8_2
  • brotlipy==0.7.0=py39hb18efdd_1004
  • libcurl==7.85.0=hd538317_0
  • libdeflate==1.14=h1a8c8d9_0
  • libbrotlienc==1.0.9=h1c322ee_7
  • wrapt==1.14.1=py39h9eb174b_0
  • aom==3.5.0=h7ea286d_0
  • readline==8.1.2=h46ed386_0
  • libabseil==20220623.0=cxx17_h28b99d4_4
  • zfp==1.0.0=h7b19444_1
  • hdf5==1.12.2=nompi_h8968d4b_100
  • snappy==1.1.9=h39c3846_1
  • frozenlist==1.3.1=py39h4eb3d34_0
  • jxrlib==1.1=h27ca646_2
  • libsqlite==3.39.3=h76d750c_0
  • grpcio==1.47.1=py39h13431ec_6
  • libssh2==1.10.0=hb80f160_3
  • libblas==3.9.0=16_osxarm64_openblas
  • multidict==6.0.2=py39hb18efdd_1
  • krb5==1.19.3=hf9b2bbe_0
  • libedit==3.1.20191231=hc8eb9b7_2
  • tk==8.6.12=he1e0b03_0
    Have tried the everything listed here thus far, to no avail. Have you seen this before?

Quick update: I solved the issue above by simply using a prefix (indicating the arm64 architecture) with the conda create environment command: CONDA_SUBDIR=osx-arm64 conda env create --name omnipose_sil --file omnipose_mac_environment.yml
and voila, everything starts to work.
For the first time, I see the magical words: 2023-06-15 17:11:08,636 [INFO] ** TORCH GPU version installed and working. **
However, having issue with the training using the following command:
python -m omnipose --train --pretrained_model None --use_gpu --chan 0 --dir /Users/saranshumale/Documents/Data/Asymmetry/April28MM/Cell1/BF_copy/ --n_epochs 100 --learning_rate 0.1

Error:
!NEW LOGGING SETUP! To see cellpose progress, set --verbose
No --verbose => no progress or info printed
2023-06-15 17:11:08,636 [INFO] ** TORCH GPU version installed and working. **
2023-06-15 17:11:08,636 [INFO] >>>> using GPU
Traceback (most recent call last):
File "/opt/anaconda3/envs/omnipose_sil/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/anaconda3/envs/omnipose_sil/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/saranshumale/Documents/omnipose/omnipose/main.py", line 3, in
main(omni_CLI=True)
File "/opt/anaconda3/envs/omnipose_sil/lib/python3.9/site-packages/cellpose_omni/main.py", line 254, in main
if args.nchan>1:
TypeError: '>' not supported between instances of 'NoneType' and 'int'

@su2804 Sorry I never saw there was activity on this thread. Are you still experiencing that training issue?