google-deepmind/dqn_zoo

Make Docker Container Publicly Available?

Closed this issue · 14 comments

I hate to ask this, but can you make the Docker container publicly available? I spent a few hours today trying to build it myself and I kept running into issues.

jqdm commented

I'm afraid it's not practical to share the image due to its size. The Dockerfile should be sufficient assuming the prerequisites detailed in the quick start are satisfied.

Those prerequisites are quite difficult to meet. It turns out installing NVIDIA docker is error prone and risky - I spent hours undoing the damage I did trying to install NVIDIA docker. Even if the size is large, you could share it on Google drive. Surely it can't be larger than 10 GB?

jqdm commented

NVIDIA Docker is needed to run the container so it still needs to be installed. NVIDIA Docker is widely used and well supported so you may be able to find help with installing it. Can you describe what you are trying to achieve and what your set up is? In particular do you have a GPU on the machine you are trying to run this on?

I was hoping to spare you these details, but here goes: my research institute has a SLURM based compute cluster (with accompanying GPUs) that I intend to run this on. The problem is that I don't have superuser privileges, so I can't install Docker, NVIDIA Docker, or sudoless Docker on the cluster without resorting to a virtual machine. I've since started trying that direction using Vagrant, but I ran into issues that one of the cluster managers is helping me with. Also, our cluster won't run Docker, so after I get the Docker image, I need to convert it to a Singularity image.

The other alternative is to install Docker and NVIDIA Docker on my personal machine, which does not have a GPU. While a GPU is not a prerequisite to install NVIDIA Docker (or NVIDIA drivers), the installation instructions confusingly conflate installing NVIDIA drivers with installing CUDA drivers. This is what screwed me up this past weekend.

Also, for some reason, after I installed NVIDIA Docker, my personal machine's fan started malfunctioning - it activated every 1-3 seconds, even when no processes were running. Once I uninstalled NVIDIA Docker, my fan continued to malfunction. I had to purge NVIDIA from my personal machine and only then did my fan return to normal.

I'm also not familiar with any of these tools (NVIDIA docker, NVIDIA drivers, Vagrant), so if you can suggest a simple direct solution, I would be tremendously appreciative.

What happens if I just try building the Dockerfile without installing NVIDIA docker or NVIDIA drivers?

jqdm commented

Given there isn't a GPU on your personal machine, I would avoid going down the Docker + NVIDIA Docker route. Using Docker is not a hard requirement as the README.md says. The Dockerfile is there mainly to provide a self-contained reproducible example of how to run a DQN Zoo agent on a machine with a GPU as this is the main use case.

On your personal machine I would install the dependencies with pip using requirements.txt as a guide. In particular exclude the dependencies labelled as transitive and instead of installing jaxlib with GPU support, use jaxlib==0.1.50 or similar as only CPU support is needed.

I did what you suggested a week ago and was successfully able to run the code locally on my CPU. What I've been stuck with for the past week is how to run the code on my school's compute cluster using its GPUs. How do I do that?

To clarify, as I mentioned above, the cluster doesn't permit Docker. Rather, it encourages Singularity, which is allegedly better for shared computing resources. That means I have two options: (1) build the Docker image locally and then transfer it to the cluster and convert it to a Singularity image, or (2) create a VM on the cluster, build the Docker image there, then convert it to a Singularity image.

I tried (1) and was able to successfully build the Docker image locally, without having NVIDIA docker or NVIDIA drivers installed. However, when I created a tarball, transferred it to the cluster, extracted it, converted it to a Singularity image and tried running it, I got the error:

ERROR: /bin/sh does not exist in container

I don't know how to tell if that error originates from the Docker image or from the conversion to a Singularity image or something else.

One clarifying question is whether NVIDIA docker and NVIDIA drivers is necessary for building the Docker image or running the Docker image or both?

jqdm commented

I would not persist in trying to use Docker if the cluster does not allow it. As I mentioned, using Docker is not a hard requirement and even if Singularity supports conversion from Docker, I can imagine it would be hard to debug if something goes wrong as you're discovering. It would be simpler for you to build a Singularity image that contains the DQN Zoo code and required dependencies.

Ok I'll try that. I'm sure I'll be back with more questions :)

jqdm commented

Closing this one as it relates to making the Docker image available, other issues specific to DQN Zoo can be raised separately.

@jqdm I found that building the Docker image without NVIDIA Docker and running it on a machine with CUDA works fine. If I can suggest a change to the README, you should distinguish between requirements to build and requirements to _run.