Unable to use megamolbart model

Question

Unable to use megamolbart model

muammar opened this issue 3 years ago · 8 comments

I tried following the instructions shown in the megamolbart/README, but that does not work for me:

--(Wed Apr 06|15:25 [master]$)- ./launch.sh dev 2
sourcing environment from ./.env
+ local CONTAINER_OPTION=2
+ local CONT=nvcr.io/nvstaging/clara/cheminformatics_demo:latest
+ [[ 2 -eq 2 ]]
+ DOCKER_CMD='docker run     --rm     --network host     --runtime=nvidia     -p :8888     -p 9001:9001     -p 5000:5000     -v /home/muammar/git/cheminformatics:/workspace     -v /home/muammar/git/cheminformatics/data/data:/data     -u 1000:1000     --shm-size=1g     --ulimit memlock=-1     --ulimit stack=67108864     -e HOME=/workspace     -e TF_CPP_MIN_LOG_LEVEL=3     -w /workspace -v /home/muammar/git/cheminformatics/megamolbart/models:/models/megamolbart/'
+ DOCKER_CMD='docker run     --rm     --network host     --runtime=nvidia     -p :8888     -p 9001:9001     -p 5000:5000     -v /home/muammar/git/cheminformatics:/workspace     -v /home/muammar/git/cheminformatics/data/data:/data     -u 1000:1000     --shm-size=1g     --ulimit memlock=-1     --ulimit stack=67108864     -e HOME=/workspace     -e TF_CPP_MIN_LOG_LEVEL=3     -w /workspace -v /home/muammar/git/cheminformatics/megamolbart/models:/models/megamolbart/ -w /workspace/megamolbart/'
+ CONT=nvcr.io/nvstaging/clara/megamolbart:latest
+ docker run --rm --network host --runtime=nvidia -p :8888 -p 9001:9001 -p 5000:5000 -v /home/muammar/git/cheminformatics:/workspace -v /home/muammar/git/cheminformatics/data/data:/data -u 1000:1000 --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -e HOME=/workspace -e TF_CPP_MIN_LOG_LEVEL=3 -w /workspace -v /home/muammar/git/cheminformatics/megamolbart/models:/models/megamolbart/ -w /workspace/megamolbart/ -it nvcr.io/nvstaging/clara/megamolbart:latest bash
WARNING: Published ports are discarded when using host network mode

=============
== PyTorch ==
=============

NVIDIA Release 20.11 (build 17345815)
PyTorch Version 1.8.0a0+17f8c32

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
ERROR: No supported GPU(s) detected to run this container

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

(base) bash-4.4$

After getting in the SHELL, I do:

(base) bash-4.4$ python launch.py &
[1] 54
(base) bash-4.4$ INFO:megamolbart:Maximum decoded sequence length is set to 512
INFO:megamolbart:Triggering model download...
Downloading model megamolbart to /models/megamolbart...
++ wget -q --show-progress --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/clara/megamolbart/versions/0.1/zip -O /models/megamolbart/megamolbart_0.1.zip
/models/megamolbart/megamolbart_0.1.zip: Permission denied
++ mkdir /models/megamolbart
mkdir: cannot create directory ‘/models/megamolbart’: File exists
++ unzip -q /models/megamolbart/megamolbart_0.1.zip -d /models/megamolbart
unzip:  cannot find or open /models/megamolbart/megamolbart_0.1.zip, /models/megamolbart/megamolbart_0.1.zip.zip or /models/megamolbart/megamolbart_0.1.zip.ZIP.
INFO:megamolbart:Model download result: None
INFO:megamolbart:Model download result: None
Traceback (most recent call last):
  File "launch.py", line 98, in <module>
    main()
  File "launch.py", line 94, in main
    Launcher()
  File "launch.py", line 71, in __init__
    self.download_megamolbart_model()
  File "launch.py", line 92, in download_megamolbart_model
    raise Exception('Error downloading model')
Exception: Error downloading model

The user created in the container does not have permission to write on /models/megamolbart. I am looking to feed some SMILES strings to megamolbart and generate embeddings. How can I achieve that? I would appreciate any help you could provide me. Thanks.

Answer 1 · 2022-04-11T14:07:22.000Z

I solved this by doing a chmod -R 777 cheminformatics/megamolbart/models. I also had to apply the following patch:

diff --git a/launch.sh b/launch.sh
index a77b46d..2011768 100755
--- a/launch.sh
+++ b/launch.sh
@@ -161,7 +161,7 @@ dev() {
         DOCKER_CMD="${DOCKER_CMD} -w /workspace/cuchem/"
     fi

-    ${DOCKER_CMD} -it ${CONT} bash
+    ${DOCKER_CMD} -it --gpus all --privileged -v /dev:/dev ${CONT} bash

     exit
 }
@@ -263,4 +263,4 @@ case $1 in
     *)
         usage
         ;;
-esac
\ No newline at end of file
+esac
diff --git a/megamolbart/launch.py b/megamolbart/launch.py
index a41621b..d6e2acd 100755
--- a/megamolbart/launch.py
+++ b/megamolbart/launch.py
@@ -47,7 +47,7 @@ class Launcher(object):
         parser.add_argument('-p', '--port',
                             dest='port',
                             type=int,
-                            default=50051,
+                            default=50055,
                             help='GRPC server Port')
         parser.add_argument('-l', '--max_decode_length',
                             dest='max_decode_length',

However, now inside the container, I don't have any more instructions on how to proceed to generate embeddings. Any help would be highly appreciated.

Best,

Answer 2 · 2022-04-21T19:23:45.000Z

@muammar I am working on a patch and updating additional trouble shooting comments in README.

Answer 3 · 2022-04-21T20:21:47.000Z

I solved this by doing a chmod -R 777 cheminformatics/megamolbart/models. I also had to apply the following patch:

diff --git a/launch.sh b/launch.sh
index a77b46d..2011768 100755
--- a/launch.sh
+++ b/launch.sh
@@ -161,7 +161,7 @@ dev() {
         DOCKER_CMD="${DOCKER_CMD} -w /workspace/cuchem/"
     fi

-    ${DOCKER_CMD} -it ${CONT} bash
+    ${DOCKER_CMD} -it --gpus all --privileged -v /dev:/dev ${CONT} bash

     exit
 }
@@ -263,4 +263,4 @@ case $1 in
     *)
         usage
         ;;
-esac
\ No newline at end of file
+esac
diff --git a/megamolbart/launch.py b/megamolbart/launch.py
index a41621b..d6e2acd 100755
--- a/megamolbart/launch.py
+++ b/megamolbart/launch.py
@@ -47,7 +47,7 @@ class Launcher(object):
         parser.add_argument('-p', '--port',
                             dest='port',
                             type=int,
-                            default=50051,
+                            default=50055,
                             help='GRPC server Port')
         parser.add_argument('-l', '--max_decode_length',
                             dest='max_decode_length',

However, now inside the container, I don't have any more instructions on how to proceed to generate embeddings. Any help would be highly appreciated.

Best,

Once inside the container, please use the following command to start the service:

cd /opt/nvidia/megamolbart && python3 launch.py

Answer 4 · 2022-04-21T20:50:46.000Z

Once inside the container, please use the following command to start the service:
cd /opt/nvidia/megamolbart && python3 launch.py

Thanks for your reply. I had done this already but also added & at the end of python3 launch.py to keep it running in the background. Then, how am I supposed to use smiles2embedding()?

Answer 5 · 2022-04-21T21:25:07.000Z

Once inside the container, please use the following command to start the service:
cd /opt/nvidia/megamolbart && python3 launch.py
Thanks for your reply. I had done this already but also added & at the end of python3 launch.py to keep it running in the background. Then, how am I supposed to use smiles2embedding()?

Once the service is started, one can access the service using a gRPC interface. Please refer the script https://github.com/NVIDIA/cheminformatics/blob/master/misc/generate_mols.py

A simplest gRPC code to access these functions is as below.

import generativesampler_pb2
import generativesampler_pb2_grpc
host = 'http://192.167.100.2'
with grpc.insecure_channel('192.167.100.2:50051') as channel:
    stub = generativesampler_pb2_grpc.GenerativeSamplerStub(channel)
    spec = generativesampler_pb2.GenerativeSpec(
                model=generativesampler_pb2.GenerativeModel.MegaMolBART,
                smiles='CN1C=NC2=C1C(=O)N(C(=O)N2C)C',
                radius=0.0001,
                numRequested=10)
    response = stub.FindSimilars(spec)

Please refer https://ngc.nvidia.com/containers/nvidia:clara:megamolbart for the functions exposed by this service.

Answer 6 · 2022-05-02T20:23:18.000Z

Following the website you suggested:

The first command does not work:

muammar@ussdgw-mw414 /tmp
  % ngc registry model download-version "nvidia/clara/megamolbart:0.1.2"                                                             !11378
Error: 'nvidia/clara/megamolbart:0.1.2' could not be found.

Note: using :latest does not work either. So I proceeded to remove the tag:

muammar@ussdgw-mw414 /tmp
  % ngc registry model download-version "nvidia/clara/megamolbart"                                                                   !11387
No version specified, downloading latest version: '0.1'.
Downloaded 119.38 MB in 22s, Download speed: 5.42 MB/s
----------------------------------------------------
Transfer id: megamolbart_v0.1 Download status: Completed.
Downloaded local path: /tmp/megamolbart_v0-1.1
Total files downloaded: 12
Total downloaded size: 119.38 MB
Started at: 2022-05-02 16:16:35.972564
Completed at: 2022-05-02 16:16:58.011783
Duration taken: 22s
-----------------------------

Now, I go to the next step:

muammar@ussdgw-mw414 /tmp
  % docker run \                                                                                                                     !11388
--gpus all \
--rm --privileged -v /dev:/dev \
-v $(pwd)/megamolbart_v0.1/:/models/megamolbart \
nvcr.io/nvidia/clara/megamolbart:0.1.2

That seems to work:

  weight_decay .................... 0.01
  world_size ...................... 1
  zero_allgather_bucket_size ...... 0.0
  zero_contigious_gradients ....... False
  zero_reduce_bucket_size ......... 0.0
  zero_reduce_scatter ............. False
  zero_stage ...................... 1.0
---------------- end of arguments ----------------
> initializing torch distributed ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
global rank 0 is loading checkpoint /models/megamolbart/checkpoints/iter_0134000/mp_rank_00/model_optim_rng.pt
could not find arguments in the checkpoint ...
INFO:megamolbart.service:Loaded iteration 134000

I go into the running container:

muammar@ussdgw-mw414 /tmp
  % docker exec -it 97bc85eafa5b bash                                                                                                !10010
root@97bc85eafa5b:/workspace#

Then I try to run the command:

python -m grpc_tools.protoc -I./grpc/ \
 --<>_out=generated \
 --experimental_allow_proto3_optional \
 --grpc_python_out=generated \
 ./grpc/generativesampler.proto

root@97bc85eafa5b:/workspace# python -m grpc_tools.protoc -I./grpc/ \
>  --<>_out=generated \
>  --experimental_allow_proto3_optional \
>  --grpc_python_out=generated \
>  ./grpc/generativesampler.proto
/opt/conda/bin/python: Error while finding module specification for 'grpc_tools.protoc' (ModuleNotFoundError: No module named 'grpc_tools')

Finally, I tried your script above:

root@97bc85eafa5b:/workspace# cat example.py
import generativesampler_pb2
import generativesampler_pb2_grpc
host = 'http://192.167.100.2'
with grpc.insecure_channel('192.167.100.2:50051') as channel:
    stub = generativesampler_pb2_grpc.GenerativeSamplerStub(channel)
    spec = generativesampler_pb2.GenerativeSpec(
                model=generativesampler_pb2.GenerativeModel.MegaMolBART,
                smiles='CN1C=NC2=C1C(=O)N(C(=O)N2C)C',
                radius=0.0001,
                numRequested=10)
    response = stub.FindSimilars(spec)
root@97bc85eafa5b:/workspace#

root@97bc85eafa5b:/workspace# python example.py
Traceback (most recent call last):
  File "example.py", line 4, in <module>
    with grpc.insecure_channel('192.167.100.2:50051') as channel:
NameError: name 'grpc' is not defined

I understand the module is not loaded, and that's why, but it does not exist in the container. Thus, the instructions are incomplete :(.

I see this report is related to #154

Answer 7 · 2022-05-03T15:25:48.000Z

If you are working from source, please follow these steps.

Terminal 1

./launch.sh start
This will download pre-reqs and start the application.

Terminal 2

Default IP address of MegaMolBART container is '192.168.100.2'. To confirm the actual IP address please execute the following command.

docker inspect cheminformatics_megamolbart_1 | grep IPv4Address

If IP address is used in the last step.

./launch.sh dev 1
This will place you in the container, generally in a mode useful for advance usage and development.
conda activate rapids
You might need to init conda (conda init bash) before this command
ipython3

import grpc
import generativesampler_pb2
import generativesampler_pb2_grpc
host = '192.168.100.2'
with grpc.insecure_channel(f'{host}:50051') as channel:
    stub = generativesampler_pb2_grpc.GenerativeSamplerStub(channel)
    spec = generativesampler_pb2.GenerativeSpec(
                model=generativesampler_pb2.GenerativeModel.MegaMolBART,
                smiles='CN1C=NC2=C1C(=O)N(C(=O)N2C)C',
                radius=0.0001,
                numRequested=10)
    response = stub.FindSimilars(spec)
    
print(response.generatedSmiles)

Answer 8 · 2022-05-16T18:08:42.000Z

Terminal 1

./launch.sh start
This will download pre-reqs and start the application.

For me, this always fails with the following:

± % ./launch.sh start                                                                                                                                      !13237
sourcing environment from ./.env
WARNING: The UID variable is not set. Defaulting to a blank string.
WARNING: The GID variable is not set. Defaulting to a blank string.
Removing cheminformatics_cuchemUI_1
Removing cheminformatics_megamolbart_1
Recreating ce1241ffea5d_cheminformatics_megamolbart_1 ... error
Recreating c5031e3709ad_cheminformatics_cuchemUI_1    ...

ERROR: for ce1241ffea5d_cheminformatics_megamolbart_1  no such image: sha256:59e665c69516585ac612edc22c1eca93e165833b40d5ed25a5597c9c8223e4b5: No such image: sha2
Recreating c5031e3709ad_cheminformatics_cuchemUI_1    ... error

ERROR: for c5031e3709ad_cheminformatics_cuchemUI_1  no such image: sha256:9640cadcd6412f6ee679f1c15718b2525710358c6e7a3653cd20e57f815e3d96: No such image: sha256:9640cadcd6412f6ee679f1c15718b2525710358c6e7a3653cd20e57f815e3d96

ERROR: for megamolbart  no such image: sha256:59e665c69516585ac612edc22c1eca93e165833b40d5ed25a5597c9c8223e4b5: No such image: sha256:59e665c69516585ac612edc22c1eca93e165833b40d5ed25a5597c9c8223e4b5

ERROR: for cuchemUI  no such image: sha256:9640cadcd6412f6ee679f1c15718b2525710358c6e7a3653cd20e57f815e3d96: No such image: sha256:9640cadcd6412f6ee679f1c15718b2525710358c6e7a3653cd20e57f815e3d96
ERROR: The image for the service you're trying to recreate has been removed. If you continue, volume data could be lost. Consider backing up your data before continuing.

Recreating c5031e3709ad_cheminformatics_cuchemUI_1    ... error
Recreating ce1241ffea5d_cheminformatics_megamolbart_1 ... error
Recreating c5031e3709ad_cheminformatics_cuchemUI_1    ...

ERROR: for c5031e3709ad_cheminformatics_cuchemUI_1  Cannot start service cuchemUI: Invalid address 10.59.7.62: It does not belong to any of this network's subnets

ERROR: for ce1241ffea5d_cheminformatics_megamolbart_1  Cannot start service megamolbart: Invalid address 10.59.7.62: It does not belong to any of this network's subnets

ERROR: for cuchemUI  Cannot start service cuchemUI: Invalid address 10.59.7.62: It does not belong to any of this network's subnets

ERROR: for megamolbart  Cannot start service megamolbart: Invalid address 10.59.7.62: It does not belong to any of this network's subnets
ERROR: Encountered errors while bringing up the project.

I have tried setting the subnet and it never works.