HewlettPackard/swarm-learning

How to use GPU in version 1.1.0

h-ahmad opened this issue · 3 comments

    I couldn't figure out how to use the GPU with the `run-sl` command either, but

I was able to use the GPU by using SWOP.

I added the following Docker GPU options to the SWOP Profile.

      usrcontaineropts:
        - gpus: "all"

or something like this.

      usrcontaineropts:
        - gpus: "device=6,7"

The document I referred to is below.

https://github.com/HewlettPackard/swarm-learning/blob/master/docs/User/Frequently_asked_questions.md#-how-do-you-run-swarm-learning-on-gpu

Originally posted by @IMOKURI in #101 (comment)

I am able to run model on GPU (Nvidia Geforce 3090) without docker/swarm learning. I have installed and configured cuda with pytorch. How can I utilize the GPU for swarm learning environment? Do I need to explicitly install drivers for docker again? How to add the above mentioned script in swop profile? Thanks.

Hi @h-ahmad,

User container has to be build to support GPU access.

For Tensorflow - Tensorflow-gpu installation comes with required GPU libraries, so it takes care of cuda dependencies.
- FROM tensorflow/tensorflow:2.8.0-gpu

For pytorch - You need to install cuda dependencies explicitly. You need to start with cuda image.
- FROM nvidia/cuda:10.2-devel-ubuntu18.04

Start user container with cuda image to support your host cuda version, then install python3 and other packages as needed.

Before enabling Swarm Learning, run user container with SWARM_LOOPBACK set to True, and verify whether local training could able to access GPUs for training. Once it is successful, remove SWARM_LOOPBACK so that it uses Swarm callback.

Hope this helps.

Following is for your reference - build steps in my user container to support host cuda drivers.

- FROM nvidia/cuda:10.2-devel-ubuntu18.04
- ' '
- RUN apt-get update && apt-get install -y apt-transport-https software-properties-common
- ' '
- RUN add-apt-repository -y ppa:deadsnakes/ppa
- ' '
- RUN apt-get install -y python3.8
- ' '
- RUN rm /usr/bin/python3
- ' '
- RUN ln -s /usr/bin/python3.8 /usr/bin/python3
- ' '
- RUN apt-get install -y python3-pip
- ' '
- RUN pip3 install --upgrade pip
- ' '
- RUN pip3 install imutils opencv-python pandas pillow service_identity sklearn networkx
- ' '
- RUN pip3 install torch==1.9.0+cu102 torchvision==0.10.0+cu102 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
- ' '
- RUN mkdir -p /tmp/hpe-swarmcli-pkg
- COPY swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl  /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl
- RUN pip3 install /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl

Closing this issue, as there are no follow up questions, @h-ahmad please reopen if you have further questions.