No Nonsense: Tensorflow GPU setup

1. Introduction

Setting up a GPU for machine learning with tensorflow can be a maze. This is a simple setup guide that links all the steps with a template that can be used for a quick setup. If you want a more detailed guide on what everything is doing check out my related blog post at https://leondebnath.com/no-nonsense-tensorflow-gpu.html

Note: I've tried to link all the relevant documentation, so if Tensorflow, Docker or NVIDIA update their processes, you know where to find them!

2. NVIDIA Drivers (test step)

This guide assumes you have a linux machine with a GPU with drivers installed. Different distros have varying support and drivers available, I have personally found Pop!_OS to work very well as they provide an ISO with NVIDIA support inbuilt, for Ubuntu, the apt repository holds drivers for most modern GPUs. When you have completed the installation, you can test the setup using the command

nvidia-smi

and should get a window output like this:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.02              Driver Version: 545.29.02    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:2B:00.0  On |                  N/A |
|  0%   44C    P8              16W / 170W |    909MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2986      G   /usr/lib/xorg/Xorg                          280MiB |
|    0   N/A  N/A      3099      G   /usr/bin/gnome-shell                         56MiB |
|    0   N/A  N/A      6074      G   firefox                                     558MiB |
|    0   N/A  N/A    185656      G   ...ures=SpareRendererForSitePerProcess        2MiB |
+---------------------------------------------------------------------------------------+

3. Install Docker

If you already have docker installed on your machine, skip to section 4. Otherwise, follow the official installation instructions for your distro here: https://docs.docker.com/engine/install/ubuntu/

Note: although some distros allow you to install using a repository (such as APT for debian based distros) it is strongly recommended that you use the official method

3.1 Post installation

If you hate having to type sudo before every docker command, follow the post install steps to add docker to your user group.

4. NVIDIA Container Toolkit

The container toolkit allows the container to talk to the GPU, follow the instructions: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html provided by NVIDIA. You can test your setup using the command:

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

source: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html

If this runs on your machine, then you are ready to try and run tensorflow!

5. Clone this repository

git clone https://github.com/S010MON/tensorflow-gpu

6. Run Docker Compose

Navigate to the top level directory of the repo you just cloned, and run:

docker compose up

This should start the process of downloading the image and building the container. If you have an older version of docker, you may need to use a hypen in the command docker-compose up. It may take some time for the first set-up, but will be much faster next time as all the steps are cached by docker and only the final changes are re-run.

6.1. Jupyter Notebooks

You should end up with this, use the links to access the notebooks

tensorflow-gpu  | [I 12:14:38.101 NotebookApp] Serving notebooks from local directory: /tf
tensorflow-gpu  | [I 12:14:38.101 NotebookApp] Jupyter Notebook 6.5.3 is running at:
tensorflow-gpu  | [I 12:14:38.101 NotebookApp] http://e81742778f1c:8888/?token=4e10a373c039e1a178f9c688ad4c504fad3d9bcfc48cd831
tensorflow-gpu  | [I 12:14:38.101 NotebookApp]  or http://127.0.0.1:8888/?token=4e10a373c039e1a178f9c688ad4c504fad3d9bcfc48cd831
tensorflow-gpu  | [I 12:14:38.101 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
tensorflow-gpu  | [C 12:14:38.103 NotebookApp] 
tensorflow-gpu  |     
tensorflow-gpu  |     To access the notebook, open this file in a browser:
tensorflow-gpu  |         file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
tensorflow-gpu  |     Or copy and paste one of these URLs:
tensorflow-gpu  |         http://e81742778f1c:8888/?token=4e10a373c039e1a178f9c688ad4c504fad3d9bcfc48cd831
tensorflow-gpu  |      or http://127.0.0.1:8888/?token=4e10a373c039e1a178f9c688ad4c504fad3d9bcfc48cd831

6.2 Python Scripts

Sometimes it's more convenient to run python scripts, this can be done by attaching to the container in the terminal (you will need to keep the container above running and do this in a new terminal):

Running docker ps -a will list all of your running containers:

CONTAINER ID   IMAGE                    COMMAND                  CREATED          STATUS                      PORTS                                       NAMES
35f098e8e62f   tensorflow-gpu-jupyter   "bash -c 'source /et…"   14 seconds ago   Up 13 seconds               0.0.0.0:8888->8888/tcp, :::8888->8888/tcp   tensorflow-gpu

to execute commands in the container use the command:

docker exec -it [CONTAINER_NAME] bash

for example, the default name for this template is tensorflow-gpu so the command becomes:

docker exec -it  tensorflow-gpu bash

you should see this message when you attach:

________                               _______________
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ /
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

root@35f098e8e62f:/tf/notebooks#

Now you're ready to train on the GPU